数据获取-爬虫实践

2018-07-08 本文已影响18人 Fitz_Lee

爬虫入门文章

https://zhuanlan.zhihu.com/p/24669128
https://zhuanlan.zhihu.com/p/24769534
https://zhuanlan.zhihu.com/p/25200262
https://zhuanlan.zhihu.com/p/26257790

userAgent 和动态IP设置

http://lawtech0902.com/2017/06/11/scrapy-useragent-proxyip/
https://zhuanlan.zhihu.com/p/29733174
https://github.com/hellysmile/fake-useragent

延迟和禁止cookies

https://blkstone.github.io/2016/03/02/crawler-anti-anti-cheat/

PhantomJs 和 selenium 处理Ajax

https://my.oschina.net/lewisgong/blog/872257
https://chaycao.github.io/2016/08/19/Scrapy-Selenium-Phantomjs/

页面解析 Beautiful xpath css.

https://cuiqingcai.com/1319.html

python

lxml安装

https://pypi.org/project/lxml/#files
pip install lxml-4.2.1-cp27-cp27m-win_amd64.whl
https://blog.csdn.net/g1apassz/article/details/46574963
https://blog.csdn.net/acingdreamer/article/details/53348649

pip升级

pip install --upgrade pip

requirements.txt的创建及使用

https://blog.csdn.net/orangleliu/article/details/60958525

python path 引用

https://blog.csdn.net/tony_wong/article/details/18044273

Scrapy安装错误：Microsoft Visual C++ 14.0 is required...

https://blog.csdn.net/nima1994/article/details/74931621?locationNum=10&fps=1

Scrapy shell

https://blog.csdn.net/laoyang360/article/details/52809927
Scrapy运行ImportError: No module named win32api错误
https://blog.csdn.net/u013687632/article/details/57075514

xpath

https://blog.csdn.net/manongpengzai/article/details/77109600

python log

https://blog.csdn.net/chosen0ne/article/details/7319306

scrapy link extrator

https://www.jianshu.com/p/ff9125650697

启动爬虫

进入项目的根目录，执行下列命令启动spider:
scrapy crawl dmoz

上一篇下一篇

猜你喜欢

热点阅读