Scrapy with rules
2017-07-15 本文已影响0人
方方块
Usage cases - extracting links
from scrapy.spiders import CrawlSpider, Rule
rule
LinkExtractor() - once at the page, grab all urls
from scrapy.linkextractors import LinkExtractor
rules = (rule(LinkExtractor(), ))
callback - what to do at this page
rules = (rule(LinkExtractor(), callback='parse_page', ))
parse is reserved for spider
follow - go to next page
rules = (rule(LinkExtractor(), callback='parse_page', follow=True, ))
since scrapy auto-filter out duplicate request, we have no fear on everypage category!
deny_domains - duh
beware of google.com pages, you might get banned
allow - only scrapy certain keyworded url