lina:网页链接检查工具

2021年2月9日 / 36次阅读 / Last Modified 2021年2月9日
开源项目

其实我很早就给自己做了个检查网页坏链的工具,叫blogchecker,但一直不满意。单线程,速度慢,使用了requests和beautiful soup这些强大的第三方库,还是做不好,python也处于刚开始学习的阶段。这次重新来过,lina,Link Analyzer,使用线程池技术,速度快,放弃第三方库,用python标准的urllib和re,并且用上了sqlite3数据库,幸好python标准库也支持。

地址:https://github.com/xinlin-z/lina

help info:

$ python3 lina.py -h

下面通过具体使用case来介绍!

假设我们要检查 www.pynote.net 博客,设置24个线程,每个线程工作结束后休息500毫秒,过滤掉符合wp-login.php这个re表达式的link,数据存入pynote_db数据库文件,如下:

$ python3 lina.py --url https://www.pynote.net -d pynote_db -w 24 -t 500 -e 'wp-login.php'

然后lina就开始干活了......工作过程会不断有信息输出......

lina命令行参数说明:

  • --url,3选1参数,后跟的参数必须要有协议说明,http[s]必须要有;
  • -d,3选1参数,后跟数据库文件名,lina使用sqlite3数据库;
  • --showpage,3选1参数,查看数据库文件中,某一个页面的信息(后面有说明);
  • -w,可选参数,说明线程池中worker线程的数量,可选参数,默认数量由python决定;(仅--url时有效)
  • -t,可选参数,指定每个worker线程干完活后,休息的时间长度;(仅--url时有效)
  • -e,可选参数,编写一个re表达式,用来过滤符合条件的links;(仅--url时有效)
  • -s,可选参数,单页面模式(后面有说明);(仅--url时有效)

-w 和 -t 参数配合,可以调整lina工具给网站带来的压力!

--url, -d 和 --showpage 属于 mutually exclusive。

一般情况下,lina在检查网页链接的时候,总是会遇到各种问题,网络卡,超时......因此有些页面的检查会失败,或者手动Ctrl+c终止。lina代码在设计的时候,考虑到了这些问题,重新检查失败项或者继续开始,处理方式都很简单,只需要重复上一次启动lina的命令行。

重新运行相同的命令行,效果相当于重新检查失败项,并从上次中断的地方继续!这里的关键是 --url 参数相同, -d 数据库文件不变。

lina运行结束时,会显示数据库中的统计信息,如下:

$ python3 lina.py --url https://www.pynote.net -d db1 -w 12 -e 'wp-login.php'
# add https://www.pynote.net
# add https://en.wikipedia.org/wiki/Tim_Peters_(software_engineer)
# add https://www.maixj.net/ict/sqlite3-lock-24041
# add https://en.wikipedia.org/wiki/YCbCr
# add http://eccc.hpi-web.de/report/2012/137/
http://eccc.hpi-web.de/report/2012/137/ 
https://www.maixj.net/ict/sqlite3-lock-24041 404 ('Not Found', 'Nothing matches the given URI')
https://en.wikipedia.org/wiki/Tim_Peters_(software_engineer) 
https://en.wikipedia.org/wiki/YCbCr 
GET2SUBMIT timeout, submit done...
Stat in database db1:
status code : link number
200:             878
404:             1
<urlopen error [Errno 104] Connection reset by peer>: 1
<urlopen error _ssl.c:1107: The handshake operation timed out>: 2

可以看到,有878个页面时200状态,发现一个404页面,另外有两个error!

这些信息都存放在sqlite3数据库文件中,各位可以自己用sqlite3数据库工具进行各种查询!这个数据库的结构如下:

CREATE TABLE IF NOT EXISTS link_data (
            link_id INTEGER PRIMARY KEY,
            link TEXT UNIQUE,
            type INT,
            status TEXT,
            sub_links TEXT);

lina提供了一个 --stat 命令行参数,也可以查看相同的数据统计信息,如下:

$ python3 lina.py --stat -d db1
Stat in database db1:
status code : link number
200:             878
404:             1
<urlopen error [Errno 104] Connection reset by peer>: 1
<urlopen error _ssl.c:1107: The handshake operation timed out>: 2

关于lina工作原理的一个重要信息:--url 参数指定的是一个前缀,即只有这个前缀相同的url,lina才会像爬虫一样,继续去抓取上面的links;如果前缀不相同,lina只是判断这个link的status,页面上的其它links就不关心了(对于数据库中sub_links为null)。

比如上面的例子,--url https://www.pynote.net,这个参数意味着此博客网站所有的links都会被检查,如果 --url https://www.pynote.net/abc,就意味着只有前缀为这个https://xxx/abc的links才会被抓取上面的links,继续分析。

lina还有单页面模式,用 -s 可选参数:

$ python3 lina.py --url https://www.pynote.net -d db2 -s
# add https://www.pynote.net
https://www.pynote.net 200 OK
https://www.maixj.net/tag/vim 200 OK
https://www.pynote.net/archives/3425 200 OK
https://www.pynote.net/archives/tag/exception 200 OK
https://www.pynote.net/archives/tag/decorator 200 OK
https://www.pynote.net/archives/tag/txcl 200 OK
https://www.pynote.net/archives/tag/sqlite 200 OK
https://www.pynote.net/page/2 200 OK
https://www.pynote.net/archives/3232 200 OK
https://www.pynote.net/page/10 200 OK
https://www.pynote.net/pics/uploads/2019/07/dotpy-150x150.jpg 200 OK
https://www.pynote.net/ 200 OK
https://www.pynote.net/archives/tag/sys 200 OK
https://www.pynote.net/archives/tag/tkinter 200 OK
https://www.pynote.net/archives/tag/list 200 OK
https://www.pynote.net/page/20 200 OK
https://www.pynote.net/archives/3234 200 OK
https://www.pynote.net/archives/tag/dict 200 OK
https://www.pynote.net/archives/tag/argparse 200 OK
https://www.pynote.net/pics/logo.jpg 200 OK
https://www.pynote.net/archives/tag/ann 200 OK
https://www.pynote.net/archives/3183 200 OK
https://www.pynote.net/archives/tag/excel 200 OK
https://www.pynote.net/contact 200 OK
https://www.pynote.net/archives/tag/osprj 200 OK
https://www.pynote.net/archives/3446 200 OK
https://www.pynote.net/archives/3428 200 OK
https://www.pynote.net/archives/tag/built-in-func 200 OK
https://www.maixj.net/tag/cc 200 OK
https://www.pynote.net/archives/tag/print 200 OK
https://www.maixj.net/tag/git 200 OK
https://www.pynote.net/page/26 200 OK
https://www.pynote.net/archives/tag/install 200 OK
https://www.pynote.net/archives/tag/thread 200 OK
https://www.pynote.net/archives/tag/subprocess 200 OK
https://www.pynote.net/archives/tag/args 200 OK
https://www.pynote.net/archives/tag/ctypes 200 OK
https://www.pynote.net/pics/pyicon.png 200 OK
https://www.pynote.net/archives/tag/time 200 OK
https://www.pynote.net/archives/3101 200 OK
https://www.pynote.net/page/3 200 OK
https://www.pynote.net/archives/tag/bytes 200 OK
https://www.pynote.net/archives/tag/cgi 200 OK
https://www.pynote.net/archives/tag/re 200 OK
https://www.pynote.net/archives/tag/os 200 OK
https://www.maixj.net/tag/mathjax 200 OK
https://www.pynote.net/archives/3109 200 OK
https://www.pynote.net/about 200 OK
https://www.pynote.net/page/5 200 OK
https://www.pynote.net/archives/3241 200 OK
https://www.pynote.net/archives/tag/algo 200 OK
https://www.pynote.net/archives/tag/matplotlib 200 OK
https://www.pynote.net/archives/tag/configparser 200 OK
https://www.pynote.net/sitemap 200 OK
https://www.pynote.net/cookie 200 OK
https://www.maixj.net/tag/linux-cmd 200 OK
https://www.pynote.net/archives/tag/gushi 200 OK
https://www.pynote.net/archives/3353 200 OK
https://www.pynote.net/archives/3181 200 OK
https://www.pynote.net/pics/goTop.jpg 200 OK
https://www.pynote.net/archives/3094 200 OK
https://www.pynote.net/archives/3218 200 OK
https://www.pynote.net/archives/tag/syntax 200 OK
https://www.pynote.net/archives/tag/calculate 200 OK
https://www.pynote.net/archives/tag/logging 200 OK
https://www.pynote.net/archives/3315 200 OK
https://www.maixj.net/tag/ssh 200 OK
https://www.pynote.net/archives/tag/email 200 OK
https://www.pynote.net/archives/3091 200 OK
https://www.pynote.net/archives/tag/unittest 200 OK
https://www.pynote.net/archive 200 OK
https://www.pynote.net/archives/tag/string 200 OK
https://www.pynote.net/archives/tag/set 200 OK
https://www.pynote.net/pics/uploads/2020/12/dog_edge_1-200x201.jpg 200 OK
https://www.pynote.net/archives/tag/socket 200 OK
https://www.pynote.net/links 200 OK
https://www.pynote.net/wp-content/themes/mt3/style.css 200 OK
https://www.pynote.net/page/4 200 OK
https://www.pynote.net/archives/tag/numpy 200 OK
https://www.pynote.net/archives/tag/oop 200 OK
https://www.pynote.net/archives/tag/multiprocess 200 OK
https://www.pynote.net/archives/3289 200 OK
https://www.pynote.net/archives/tag/python-cmd 200 OK
https://www.pynote.net/archives/tag/pyftpdlib 200 OK
https://www.pynote.net/archives/3154 200 OK
https://www.pynote.net/archives/3457 200 OK
GET2SUBMIT timeout, submit done...
Stat in database db2:
status code : link number
200:             86

单页面模式的打印信息有限,因此就全部copy在这里show一下了。

lina的速度相对较快,其中有一个原因,是对于资源文件,比如图片,css,js等文件,采用的是HTTP HEAD消息!其它html页面,都采用HTTP GET消息。

检查了完了网站,除了能够获取到统计信息,除了通过独立的sqlite3数据库工具查询信息外,lina还提供一个 --showpage 参数,可以查看某一个页面上所有链接的信息:

$ python3 lina.py --showpage https://www.pynote.net/about -d db1
link https://www.pynote.net/about is 200, sub_links:
https://www.pynote.net/archives/tag/thread 200
https://www.pynote.net/sitemap 200
https://www.pynote.net/archives/tag/bytes 200
https://www.pynote.net/archives/tag/email 200
https://www.pynote.net/archives/2445 200
https://www.pynote.net/archives/tag/sys 200
https://www.pynote.net/cookie 200
https://www.pynote.net/archives/tag/excel 200
https://www.pynote.net/archives/967 200
https://www.pynote.net/archives/2497 200
https://www.pynote.net/archives/tag/oop 200
https://www.pynote.net/archives/2606 200
https://www.pynote.net/archives/2121 200
https://www.pynote.net/archives/tag/built-in-func 200
https://www.pynote.net/archives/tag/calculate 200
https://www.pynote.net/archives/tag/algo 200
https://www.pynote.net/archives/tag/ann 200
https://www.pynote.net/archives/tag/os 200
https://www.pynote.net/archives/tag/logging 200
https://www.pynote.net/archives/tag/argparse 200
https://www.pynote.net/archives/tag/multiprocess 200
https://www.pynote.net/archives/tag/cgi 200
https://www.pynote.net/archives/tag/subprocess 200
https://www.pynote.net/archives/tag/pyftpdlib 200
https://www.pynote.net/archives/tag/unittest 200
https://www.pynote.net/archives/tag/install 200
https://www.pynote.net/archives/tag/set 200
https://www.pynote.net/archives/tag/string 200
https://www.pynote.net/archives/3428 200
https://www.pynote.net/wp-content/themes/mt3/style.css 200
https://www.pynote.net/pics/goTop.jpg 200
https://www.pynote.net/ 200
https://www.pynote.net/archives/2745 200
https://www.pynote.net/archives/1321 200
https://www.pynote.net/archives/1359 200
https://www.pynote.net/archives/2273 200
https://www.pynote.net/archives/356 200
https://www.pynote.net/archives/2861 200
https://www.maixj.net 200
https://www.pynote.net/pics/uploads/2019/06/use_python.jpg 200
https://www.pynote.net/archives/tag/syntax 200
https://www.pynote.net/links 200
https://www.pynote.net/pics/logo.jpg 200
https://www.pynote.net/archives/tag/list 200
https://www.pynote.net/archives/tag/gushi 200
https://www.pynote.net/archives/3183 200
https://www.pynote.net/archives/tag/sqlite 200
https://www.pynote.net/about 200
https://www.pynote.net/archives/1874 200
https://www.pynote.net 200
https://www.pynote.net/archives/tag/ctypes 200
https://www.pynote.net/archives/1175 200
https://www.pynote.net/archives/tag/decorator 200
https://www.pynote.net/archives/tag/numpy 200
https://www.pynote.net/archives/2210 200
https://www.pynote.net/archives/2000 200
https://www.pynote.net/contact 200
https://www.pynote.net/archives/tag/configparser 200
https://www.pynote.net/archives/tag/osprj 200
https://www.pynote.net/archives/2219 200
https://www.pynote.net/archives/tag/txcl 200
https://www.pynote.net/archives/3066 200
https://www.pynote.net/archives/939 200
https://www.pynote.net/archives/tag/matplotlib 200
https://www.pynote.net/archives/tag/print 200
https://www.pynote.net/pics/pyicon.png 200
https://www.pynote.net/archives/tag/python-cmd 200
https://www.pynote.net/archives/tag/tkinter 200
https://www.pynote.net/archives/tag/time 200
https://www.pynote.net/archives/tag/args 200
https://www.pynote.net/archives/tag/exception 200
https://www.pynote.net/archive 200
https://www.pynote.net/archives/tag/socket 200
https://www.pynote.net/archives/tag/dict 200
https://www.pynote.net/archives/tag/re 200
https://www.pynote.net/archives/2557 200
https://www.pynote.net/archives/1325 200

-- EOF --

本文链接:https://www.pynote.net/archives/3481

留言区

您的电子邮箱地址不会被公开。 必填项已用*标注


前一篇:
后一篇:

More


©Copyright 麦新杰 Since 2019 Python笔记

go to top