【已解决】批处理如何批量下载网页链接中的pdf文件？ - BAT求助&讨论 - 批处理之家 BAT,CMD,批处理,PowerShell,VBS,DOS

返回列表发帖

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

1楼 跳转到 »

发表于 2016-3-22 10:26 | 显示全部帖子

本帖最后由 ivor 于 2017-11-24 21:40 编辑

回复 1# wzf1024

Python3.5
下载地址保存为list.txt，复制到迅雷批量下载

# coding:utf-8
import bs4
import urllib.request as url


web_site = 'http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=%s'
with open('list.txt','w') as wfile:
    for num in range(10001,16715):
        try:
            req = url.urlopen(web_site % str(num)[1:])
            soup = bs4.BeautifulSoup(req,'html.parser')
            for i in soup.find_all('a'):
                if i.string == '全文下载':
                    downloadUrl = url.unquote(i.get('href'))
                    print(downloadUrl, file=wfile, flush=True)
                    print(downloadUrl)
                    break      
        except:
            print("服务器错误！请检查网址连接    当前id=%s" % str(num)[1:])
            pass
input("回车结束")
复制代码

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

2楼

发表于 2016-3-22 10:52 | 显示全部帖子

回复 6# codegay

好

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

3楼

发表于 2016-3-22 10:59 | 显示全部帖子

本帖最后由 ivor 于 2017-11-24 21:34 编辑

回复 1# wzf1024

链接：http://pan.baidu.com/s/1bpB9F1l

我爬下来的pdf链接

2 评分人数

wzf1024: 乐于助人。非常感谢技术 + 1
codegay: 1技术 + 1

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

4楼

发表于 2017-11-24 21:41 | 显示全部帖子

回复 12# 775405984

下载链接更新了，建议你学学Python。

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

5楼

发表于 2018-2-3 22:39 | 显示全部帖子

回复 17# 775405984

https://pan.baidu.com/s/1dGQORIh

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

ivor

上校

Rank: 6 Rank: 6

帖子: 979
积分: 3381
技术: 172
捐助: 40
注册时间: 2012-1-7

6楼

发表于 2018-2-19 22:10 | 显示全部帖子

本帖最后由 ivor 于 2018-2-20 10:19 编辑

换了个思路，采用list弹出元素的方式，发现还是很方便的哈，效率主要还是看服务器处理的速度。
10个线程够这下速度够快了吧。。。

# coding:utf-8
# 10线程
#

import bs4
import urllib.request as url
import threading
import time


s = time.time()
pdfUrl = []
numList = ['{:0>4}'.format(i) for i in range(1, 2150)]
def getPdfUrl(threadKey = 'default'):
    web_site = r'http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=%s'
    while len(numList):
        num = numList.pop()
        try:
            req = url.urlopen(web_site % num)
            soup = bs4.BeautifulSoup(req,'html.parser')
            for i in soup.find_all('a'):
                if i.string == '全文下载':
                    pdf = url.unquote(i.get('href'))
                    pdfUrl.append(pdf + '\n')
                    print("Thread[%s]: %s" % (threadKey,pdf))
                    break
            
        except:
            print("服务器错误！   当前id=%s" % num)
            
    print("Thread[%s]: End!!" % threadKey)
    return 

def writeList(pdfLink):
    with open("list.txt", "w") as file:
        file.writelines(pdfLink)

#线程实体list
t = ['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10']
for i in t:
    i = threading.Thread(target=getPdfUrl,args=(i,))
    i.start()

while True:
    time.sleep(1)
    if threading.active_count() == 1:
        writeList(pdfUrl)
        print("\n\n耗时: %f 秒" % (time.time() - s))
        break
复制代码

#&cls&@powershell "Invoke-Expression ([Io.File]::ReadAllText('%~0',[Text.Encoding]::UTF8))" &pause&exit

TOP

返回列表

[新手上路]批处理新手入门导读	[视频教程]批处理基础视频教程	[视频教程]VBS基础视频教程	[批处理精品]批处理版照片整理器
[批处理精品]纯批处理备份&还原驱动	[批处理精品]CMD命令50条不能说的秘密	[在线下载]第三方命令行工具	[在线帮助]VBScript / JScript 在线参考

[收藏此主题] [关注此主题的新回复]

[通过 QQ、MSN 分享给朋友]