pyspider 支持 js 时 fetch_type='js'

2019-08-28 本文已影响0人 blaze冰叔

在学习使用pyspider时遇到一个问题，当页面中有js处理时，需要在self.crawl 中加入fetch_type='js'参数如下

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://travel.qunar.com/travelbook/list.htm', callback=self.index_page)
    
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('li > .tit > a').items():
            self.crawl(each.attr.href, callback=self.detail_page, fetch_type="js")
        next=response.doc('.next').attr.href
        self.crawl(next,callback=self.index_page)
    
    
    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

但是呢，遇到问题了，加了fetch_type='js'之后，该请求就报错

[E 190828 13:41:01 base_handler:203] 501 Server Error

判断应该是js处理的库 phantomjs的问题，发现本地没有安装phantomjs，
1 下载phantomjs，目前phantomjs封档暂停维护了，所以最新的版本也就到2.1.1
2 下载好的文件夹随便放在那个平常不会动的目录下即可
3 配置.bash_profile
vim .bash_profile
添加export PATH
export PATH=/{你自己找的路径}/phantomjs-2.1.1-macos/bin:$PATH
如下

image.png
接下来就是配置生效操作
source .bash_profile
然后就是查看操作
echo $PATH
确认配置信息已添加即可

接下来就是重要一步，重启你的pyspider
然后，就可以正常使用了

pyspider 支持 js 时 fetch_type='js'

猜你喜欢

热点阅读