Pyspider 示例代码运行的相关问题

2018-10-25  本文已影响8人  万事皆成

Level 1: HTML and CSS Selector

官网提供代码运行结果不对,修改如下:

import re
from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1', callback=self.index_page)


    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="https"]').items():
            if re.match("https://www.imdb.com/title/tt\d+/\?ref_=", each.attr.href):
                movie_group = re.match('(https://www.imdb.com/title/tt\d+/).*', each.attr.href)
                self.crawl(movie_group.groups()[0], callback=self.detail_page)
                
    @config(priority=2)
    def detail_page(self, response):
        
        title = re.search('<h1 class="">(.*?)<', response.text).group(1).replace('&nbsp;', '').strip()
        
        item_list = response.doc('.credit_summary_item').items()
        director = []
        for item in item_list:
            if "Director" in item('h4').text():
                director = [x.text() for x in item('a').items()]
                          
        return {
                "url": response.url,
                "title": title,
                "rating": response.doc('[itemprop="ratingValue"]').text(),
                "director": director,
               }

Level 2: AJAX and More HTTP

使用 Postman 模拟请求失败 待完成!!!
通过浏览器跟踪不到原有 XHR 数据接口: http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1
新的 json 数据请求为: https://gql.twitch.tv/gql
header:

POST /gql HTTP/1.1
Host: gql.twitch.tv
Connection: keep-alive
Content-Length: 255
Pragma: no-cache
Cache-Control: no-cache
Origin: https://www.twitch.tv
Accept-Language: zh-CN
Client-Id: kimne78kx3ncx6brgo4mv6wki5h1ko
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
X-Device-Id: 326baa0403887e01
Content-Type: text/plain;charset=UTF-8
Accept: */*
Referer: https://www.twitch.tv/directory/game/Dota%202
Accept-Encoding: gzip, deflate, br

payload:

[{"operationName":"DirectoryPage_Game","variables":{"name":"dota 2","limit":30,"sort":"VIEWER_COUNT","tags":[],"cursor":"Nzc="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"f7c5ea69517715f8ab06d30ce66f6355af61593ac0ff806b518286932d177cc7"}}}]

Level 3: Render with PhantomJS

使用 PhantomJS 获取页面 http://www.twitch.tv/directory/game/Dota%202 失败!!!

成功: pyspider 爬虫教程(三):使用 PhantomJS 渲染带 JS 的页面

问题: 浏览器中能获取到的 dom,pyspider + phantomjs 获取不到
解决方法:在项目列表中,将项目的状态设置为 debug 或者 running,再重新运行项目

上一篇 下一篇

猜你喜欢

热点阅读