4. Urllib -- urllib.robotparser

2018-06-14  本文已影响28人  江湖十年

禁止所有爬虫访问任何内容

User-Agent:  *
Disallow:  /

允许所有爬虫访问任何内容

User-Agent:  *
Disallow: 

只允许爬虫访问 public/ 目录

User-agent: *
Disallow: /
Allow: /public/

允许某些爬虫访问某些目录

User-agent:  Baiduspider
Allow:  /article
Allow:  /oshtml
Disallow:  /product/
Disallow:  /

User-Agent:  Googlebot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Disallow:  /

User-Agent:  *
Disallow:  /
# 淘宝 robots 协议示例网址
https://www.taobao.com/robots.txt
爬虫名称 所属公司 网址
Baiduspider 百度 www.baidu.com
Googlebot 谷歌 www.google.com
Bingbot 微软必应 cn.bing.com
360Spider 360搜索 www.so.com
Yisouspider 神马搜索 http://m.sm.cn/
Sogouspider 搜狗搜索 https://www.sogou.com/
Yahoo! Slurp 雅虎 https://www.yahoo.com/
... ... ...

urllib.robotparser.RobotFileParser(url='')

urllib.robotparser.RobotFileParser(url='https://www.taobao.com/robots.txt')

常用方法:

  • set_url(url):用来设置 robots.txt 文件链接,如果在初次实例化 RobotFileParser 类的时候传入了 url 参数,那么就不需要再次调用此方法设置了
  • read():读取 robots.txt 文件并将读取结果交给 parse() 解析器进行解析
  • parse(lines):用来解析 robots.txt 文件内容,分析传入的某些行的协议内容
  • can_fetch(useragent, url):需要两个参数,User-Agent、所要抓取的 URL 链接,返回此搜索引擎是否允许抓取此 URL,返回结果为 True、False
  • mtime():返回上次抓取分析 robots.txt 文件的时间,这对于需要对 robots.txt 进行定期检查更新的长时间运行的网络爬虫非常有用 。
  • modified():同样的对于长时间分析和抓取的搜索爬虫很有帮助,将当前时间设置为上次抓取和分析 robots.txt 的时间。
  • crawl_delay(useragent):返回抓取延迟时间的值,从相应的 User-Agent 的 robots.txt 返回 Crawl-delay 参数的值。 如果没有这样的参数,或者它不适用于指定的 User-Agent,或者此参数的 robots.txt 条目语法无效,则返回 None。
  • request_rate(useragent):从robots.txt返回Request-rate参数的内容,作为命名元组RequestRate(requests,seconds)。 如果没有这样的参数,或者它不适用于指定的useragent,或者此参数的robots.txt条目语法无效,则返回None。(Python3.6新增方法)

关于 Request-rate 和 Crawl-delay 都是对爬虫爬取频率做限制的,知乎的 robots.txt 协议中就有这两个参数(https://www.zhihu.com/robots.txt),关于这两个参数可以参考知乎问答:https://www.zhihu.com/question/264161961/answer/278828570

"""
分析知乎 Robots 协议
"""
import urllib.robotparser


rp = urllib.robotparser.RobotFileParser()

# 设置 robots.txt 文件 URL
rp.set_url('https://www.zhihu.com/robots.txt')

# 读取操作必须有, 不然后面解析不到
rp.read()

# 判断网址是否运行爬取
print(rp.can_fetch('Googlebot', 'https://www.zhihu.com/question/264161961/answer/278828570'))
print(rp.can_fetch('*', 'https://www.zhihu.com/question/264161961/answer/278828570'))

# 返回上次抓取分析 robots.txt 时间
print(rp.mtime())

# 将当前时间设置为上次抓取和分析 robots.txt 的时间
rp.modified()
print(rp.mtime())  # 再次打印时间会有变化

# 返回 robots.txt 文件对请求速率限制的值
print(rp.request_rate('*'))
print(rp.request_rate('MSNBot').requests)

# 返回 robotx.txt 文件对抓取延迟限制的值
print(rp.crawl_delay('*'))
print(rp.crawl_delay('MSNBot'))
True
False
1525659777.562644
1525659777.5746505
None
1
None
10
"""
使用 parser() 方法执行读取和分析
"""
import urllib.robotparser
import urllib.request


rp = urllib.robotparser.RobotFileParser()
rp.parse(urllib.request.urlopen('https://www.zhihu.com/robots.txt').read().decode('utf-8').split('\n'))

print(rp.can_fetch('Googlebot', 'https://www.zhihu.com/question/264161961/answer/278828570'))
print(rp.can_fetch('*', 'https://www.zhihu.com/question/264161961/answer/278828570'))
True
False

知乎 robots.txt 文件

User-agent: Googlebot
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Googlebot-Image
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Baiduspider-news
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Baiduspider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Baiduspider-image
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sosospider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: bingbot
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: 360Spider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: HaosouSpider 
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: yisouspider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: YoudaoBot
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou Orion spider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou News Spider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou blog
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou spider2
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou inst spider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: Sogou web spider
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: EasouSpider
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 10
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-agent: MSNBot
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 10
Disallow: /login
Disallow: /logout
Disallow: /resetpassword
Disallow: /terms
Disallow: /search
Disallow: /notifications
Disallow: /settings
Disallow: /inbox
Disallow: /admin_inbox
Disallow: /*?guide*
Disallow: /people/*

User-Agent: *
Disallow: /
上一篇 下一篇

猜你喜欢

热点阅读