批处理之家 - Powered by Discuz! Board

标题: [网络连接] 【已解决】批处理如何批量下载网页链接中的pdf文件？ [打印本页]

作者: wzf1024 时间: 2016-3-21 23:35 标题: 【已解决】批处理如何批量下载网页链接中的pdf文件？

本帖最后由 pcl_test 于 2016-11-17 23:21 编辑

有规律的网址：
http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=0001
……
http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=2150

        	
                <li><a href="#method">方法</a></li>
        	
                <li><a href="#result">结果诊断</a></li>
        	
                <li><a href="#note">注意事项</a></li>
        	
            <li><a href="http://pmmp.cnki.net/Resources/CDDPdf/%e4%b8%b4%e5%ba%8a%e6%93%8d%e4%bd%9c%e8%a7%84%e8%8c%83%5cB2.6.1.6 %e4%bd%93%e4%bd%8d%e5%8f%98%e6%8d%a2%e8%af%95%e9%aa%8c.pdf">全文下载</a></li>
        </ul>
    	</div>

    </div>
复制代码

每个页面都有一个“全文下载”链接到一个pdf文件，文件名不规则，一个一个下载太麻烦，如何用批处理全部下载这些pdf文件？
哪位大侠能帮忙，非常感谢！！

作者: CrLf 时间: 2016-3-21 23:58

@echo off
more +4 %0 | mshta http://bathome.net/s/hta/ eval(WSH.StdIn.ReadAll())
pause & exit /b

for(var i=10001;i<12150;i++){
	var url='http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id='+(''+i).substr(1)
	var html=web(url)
	var match=html.match(/([^\u0022]+?)\u0022>全文下载/)
	if(!match)break
	download(decodeURI(match[1]))
}
复制代码

作者: wzf1024 时间: 2016-3-22 01:24

回复 2# CrLf

老大，这个怎么用啊，菜鸟不懂啊，运行bat没结果，求你了，指点指点啊

作者: wzf1024 时间: 2016-3-22 10:14

CrLf 的混编代码怎么用啊，哪位大侠告知，
为什么我保存bat运行后，什么都没得到就“请按任意键继续”，按键就退出了？我是xp系统

作者: ivor 时间: 2016-3-22 10:26

本帖最后由 ivor 于 2017-11-24 21:40 编辑

回复 1# wzf1024

Python3.5
下载地址保存为list.txt，复制到迅雷批量下载

# coding:utf-8
import bs4
import urllib.request as url


web_site = 'http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=%s'
with open('list.txt','w') as wfile:
    for num in range(10001,16715):
        try:
            req = url.urlopen(web_site % str(num)[1:])
            soup = bs4.BeautifulSoup(req,'html.parser')
            for i in soup.find_all('a'):
                if i.string == '全文下载':
                    downloadUrl = url.unquote(i.get('href'))
                    print(downloadUrl, file=wfile, flush=True)
                    print(downloadUrl)
                    break      
        except:
            print("服务器错误！请检查网址连接    当前id=%s" % str(num)[1:])
            pass
input("回车结束")
复制代码

作者: codegay 时间: 2016-3-22 10:37

回复 5# ivor

说明要加python3.

作者: ivor 时间: 2016-3-22 10:52

回复 6# codegay

好

作者: ivor 时间: 2016-3-22 10:59

本帖最后由 ivor 于 2017-11-24 21:34 编辑

回复 1# wzf1024

链接：http://pan.baidu.com/s/1bpB9F1l

我爬下来的pdf链接

作者: CrLf 时间: 2016-3-22 14:39

回复 4# wzf1024

bat 的目录下没有出现一堆 pdf 吗？

作者: wzf1024 时间: 2016-3-22 18:25

回复 9# CrLf

木有。
不知道什么原因

作者: pcl_test 时间: 2016-11-17 23:19

本帖最后由 pcl_test 于 2016-11-17 23:29 编辑

rem win7及以上系统运行
set "url=http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id="
powershell -c "Add-Type -AssemblyName System.Web;$web=New-Object Net.WebClient;$web.Encoding=[Text.Encoding]::utf8;1..2150|%%{$htmltext=$web.DownloadString('%url%'+('{0:d4}' -f $_));if($htmltext -match '[^\"]+(?=\">全文下载)'){[Web.HttpUtility]::UrlDecode(''+$_+' '+$matches[0])}}"
pause
复制代码

作者: 775405984 时间: 2017-11-23 21:55

本帖最后由 775405984 于 2017-11-23 22:42 编辑

回复 8# ivor

大大能否再发一遍链接，上面的已失效
还有这个：
http://pmmp.cnki.net/Disease/Details.aspx?id=0001
......
......
http://pmmp.cnki.net/Disease/Details.aspx?id=6715

作者: 775405984 时间: 2017-11-23 22:46

回复 11# pcl_test

版主，获取的下载链接如何保存文本？

作者: ivor 时间: 2017-11-24 21:41

回复 12# 775405984

下载链接更新了，建议你学学Python。

作者: 775405984 时间: 2018-2-3 19:36

回复 14# ivor

大神，我医学生，能找到这个网就已经拜佛了。。。

可不可以帮帮忙，发一个list.txt....

http://pmmp.cnki.net/Disease/Details.aspx?id=0001
......
......
http://pmmp.cnki.net/Disease/Details.aspx?id=6715

作者: codegay 时间: 2018-2-3 19:59

回复 15# 775405984

跟性别与专业无关。这个有个医学生的笔记，好像是个妹子。好多技术类的网站都转载过她的学习笔记。
这个学习能力，笔记能力，组织能力真的是让我惊叹。

https://woaielf.github.io/2017/06/13/python3-all/

作者: 775405984 时间: 2018-2-3 20:05

回复 16# codegay

授人以鱼不如授人以渔，道理我都懂。。。

作者: ivor 时间: 2018-2-3 22:39

回复 17# 775405984

https://pan.baidu.com/s/1dGQORIh

作者: 523066680 时间: 2018-2-4 21:26

四线程+队列，Mojo 本身支持多线程，不过还没学会

=info
    4线程+队列
    523066680@163.com
=cut

use Modern::Perl;
use Encode;
use threads;
use threads::shared;
use Thread::Queue;
use File::Basename;
use URI::Escape;
use Mojo::UserAgent;
STDOUT->autoflush(1);

my @ths;
my $que = Thread::Queue->new();    # A new empty queue
my $link = "http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=";
my @mission = map { sprintf "%s%04d", $link, $_ } (1 .. 2150);

#创建线程
grep { push @ths, threads->create( \&thread_func, $_ ) } ( 0..3 );

$que->enqueue( @mission );
$que->end();
grep { $_->join() } @ths;

exit;

sub thread_func
{
    my ( $id ) = shift;
    my ($link, $file, $res, $dom);

    my $ua = Mojo::UserAgent->new();
    $ua = $ua->max_redirects(5);

    while (defined(my $link = $que->dequeue())) 
    {
        $file = basename( $link );
        $res = $ua->get( $link )->result;
        $res->body =~/(http.*?.pdf)/;
        $link = encode('gbk', decode('utf8', uri_unescape($1)));
        say $link;
    }
}
复制代码

结果见附件

作者: 775405984 时间: 2018-2-9 21:53

回复 18# ivor

谢谢~~~~

作者: ivor 时间: 2018-2-19 22:10

本帖最后由 ivor 于 2018-2-20 10:19 编辑

换了个思路，采用list弹出元素的方式，发现还是很方便的哈，效率主要还是看服务器处理的速度。
10个线程够这下速度够快了吧。。。

# coding:utf-8
# 10线程
#

import bs4
import urllib.request as url
import threading
import time


s = time.time()
pdfUrl = []
numList = ['{:0>4}'.format(i) for i in range(1, 2150)]
def getPdfUrl(threadKey = 'default'):
    web_site = r'http://pmmp.cnki.net/OperatingDiscipline/Details.aspx?id=%s'
    while len(numList):
        num = numList.pop()
        try:
            req = url.urlopen(web_site % num)
            soup = bs4.BeautifulSoup(req,'html.parser')
            for i in soup.find_all('a'):
                if i.string == '全文下载':
                    pdf = url.unquote(i.get('href'))
                    pdfUrl.append(pdf + '\n')
                    print("Thread[%s]: %s" % (threadKey,pdf))
                    break
            
        except:
            print("服务器错误！   当前id=%s" % num)
            
    print("Thread[%s]: End!!" % threadKey)
    return 

def writeList(pdfLink):
    with open("list.txt", "w") as file:
        file.writelines(pdfLink)

#线程实体list
t = ['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10']
for i in t:
    i = threading.Thread(target=getPdfUrl,args=(i,))
    i.start()

while True:
    time.sleep(1)
    if threading.active_count() == 1:
        writeList(pdfUrl)
        print("\n\n耗时: %f 秒" % (time.time() - s))
        break
复制代码

欢迎光临批处理之家 (http://www.bathome.net/)