Intro

2017-07-15 本文已影响0人方方块

(Optional) Create virtual environment

prefer using python version 3
mkvirtualenv --python=/usr/bin/python3 python3

check pip version by pip --version to make sure python 3 is used

Steps

scrapy startproject name
scrapy genspider botname url

robotstxt in setting should be true to always crawl permitted pages and be a good web citizen

inside project folder scrapy crawl botname
test in shell
scrapy crawl botname -o xx.json or csv to see result

shell to debug and test

scrapy shell

test url is valid - fetch(url)
test valid html - view(response.body)

Alternative xpath testing tool
http://www.freeformatter.com/xpath-tester.html

Xpath docs

uses response from selector

selctor, as it is named, selects html content,
from scrapy.selector import Selector
Since this is a common operation, response.selector is shorten to .xpath()

Extra
css can also be used as selector, but xpath is the official way

//name or //* - relative select every instance of html tag name or all
text() - text content in unicode
'//name[1]' - python isolated selector for ('//name')[0], use either
. - extracting first instance of data that is not response, can also just omit //
@ - attribute grabbing

if itemprop exist, use it over class to extract

Tools to get xpath fast -

Paste_Image.png

https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

Intro

(Optional) Create virtual environment

Steps

shell to debug and test

Xpath docs

猜你喜欢

热点阅读