Board logo

标题: [文本处理] (已解决)网页文件 一行内容 提取图片地址 不需要重复的 [打印本页]

作者: web    时间: 2018-1-18 16:00     标题: (已解决)网页文件 一行内容 提取图片地址 不需要重复的

本帖最后由 web 于 2018-1-19 13:57 编辑

<p style="padding: 0px; line-height: 1.5; clear: both; color: rgb(51, 51, 51); font-family: &quot;Hiragino Sans GB&quot;, Tahoma, Arial, 宋体, sans-serif;"><img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_9480.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_9480.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_5111.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_5111.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_4181.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_4181.jpg" /><br /> <br /> <br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_8536.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_8536.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_2145.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_2145.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_4315.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_4315.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_5113.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_5113.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_7621.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_7621.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_2878.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_2878.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_9000.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_9000.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_605.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_605.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_8239.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_8239.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_5145.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_5145.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_3003.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_3003.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_6521.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_6521.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_9915.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_9915.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_2703.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_2703.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123757_1357.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123757_1357.jpg" /></p>





<p><img src="/upload/files/2017/08/20/1503223004634.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223004634.jpg" height="1311" width="740" /><img src="/upload/files/2017/08/20/1503223032210.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223032210.jpg" height="9860" width="740" /><img src="/upload/files/2017/08/20/1503223054641.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223054641.jpg" height="5919" width="740" /></p>



2段文件都是只有一行  取共同的方法
网页文件  提取图片地址  不需要重复的  不需要引号
作者: 523066680    时间: 2018-1-18 19:38

  1. use Mojo::DOM;
  2. use File::Slurp;
  3. my $html = read_file( "a.htm" );
  4. my $dom = Mojo::DOM->new( $html );
  5. grep { print $_->attr("data-original"), "\n" } ( $dom->find("img")->each  );
复制代码
/upload/externalpic/1214218/1214218_20170827123754_9480.jpg
/upload/externalpic/1214218/1214218_20170827123754_5111.jpg
/upload/externalpic/1214218/1214218_20170827123754_4181.jpg
/upload/externalpic/1214218/1214218_20170827123754_8536.jpg
/upload/externalpic/1214218/1214218_20170827123754_2145.jpg
/upload/externalpic/1214218/1214218_20170827123754_4315.jpg
/upload/externalpic/1214218/1214218_20170827123755_5113.jpg
/upload/externalpic/1214218/1214218_20170827123755_7621.jpg
/upload/externalpic/1214218/1214218_20170827123755_2878.jpg
/upload/externalpic/1214218/1214218_20170827123755_9000.jpg
/upload/externalpic/1214218/1214218_20170827123755_605.jpg
/upload/externalpic/1214218/1214218_20170827123755_8239.jpg
/upload/externalpic/1214218/1214218_20170827123756_5145.jpg
/upload/externalpic/1214218/1214218_20170827123756_3003.jpg
/upload/externalpic/1214218/1214218_20170827123756_6521.jpg
/upload/externalpic/1214218/1214218_20170827123756_9915.jpg
/upload/externalpic/1214218/1214218_20170827123756_2703.jpg
/upload/externalpic/1214218/1214218_20170827123757_1357.jpg
/upload/files/2017/08/20/1503223004634.jpg
/upload/files/2017/08/20/1503223032210.jpg
/upload/files/2017/08/20/1503223054641.jpg

作者: slore    时间: 2018-1-19 10:00

本帖最后由 slore 于 2018-1-19 10:01 编辑

extractimg.rb (ruby)
  1. puts File.read('a.html').scan(/\/upload[^.]+\.jpg/).uniq
复制代码
代码解释:读取文件,扫码获取jpg文件的正则表达式匹配,再利用数组的uniq(唯一)方法,去掉重复匹配。
作者: web    时间: 2018-1-19 10:50

谢谢 各位的回复   有没有批处理 或者批处理使用第三方的方法  其他语言还不会使用 麻烦了
作者: WHY    时间: 2018-1-19 13:17

本帖最后由 WHY 于 2018-1-20 19:55 编辑
  1. @echo off
  2. PowerShell -c "[string]$s=type a.html;[regex]::Matches($s,'(?<=src=\")[^^\"]+')|%%{$_.Value}"
  3. pause
复制代码

作者: web    时间: 2018-1-19 13:52

回复 5# WHY


    谢谢了   搞定
刚才找了一下   
找到这个 接近
sed "y/;&/\n\n/" utf.txt | sed -n "/.*src=/ s/.*src=//p">b.txt
作者: WHY    时间: 2018-1-20 19:59

回复 6# web


    允许使用第三方的话,推荐 grep
  1. grep -P -o "(?<=src=\")[^^\"]+" a.html
复制代码
非要用 sed,或许可以这样:
  1. sed -r "s/(src=|[^\"]\.jpg)\"/\1\n/g" a.html | findstr /b /e "\/.*\.jpg"
复制代码





欢迎光临 批处理之家 (http://www.bathome.net/) Powered by Discuz! 7.2