[spider]使用scrapy爬取某足球网站内容

2019-01-24  本文已影响12人  Franckisses

今天写了一个爬虫,抓取了一下国内知名的某足球网站的内容。
首先就是去创建项目:

scrapy startproject dongXXXdi

然后去创建一个爬虫

scrapy genspider DQD "dongXXXdi.com"

然后出现了如下的目录:


项目的目录结构

具体的网页结构就不分析了。可以参考我上一篇博客的Chrome的network调试。
直接上代码:
先看看spider.py 。

 # -*- coding: utf-8 -*-
 import scrapy

class DqdSpider(scrapy.Spider):
    name = "DQD"
    allowed_domains = ["dongqiudi.com"]
    start_urls = ['http://dongqiudi.com/archives/1?page=1']

    def parse(self, response):
        html = response.text
        text = json.loads(html)
        dataArray = text['data']
        for data in dataArray:
            yield data

    for i in range(2,50):     #暂时就先抓取50页内容
        new_url = "http://dongqiudi.com/archives/1?page={}".format(i)
        yield scrapy.Request(url=new_url,callback=self.parse)  #回调函数

再看看items.py

import scrapy


class DongqiudiItem(scrapy.Item):
    # define the fields for your item here like:
    id = scrapy.Field()     
    title = scrapy.Field()  
    discription = scrapy.Field()     
    user_id = scrapy.Field()      
    type = scrapy.Field()
    display_time = scrapy.Field()
    thumb = scrapy.Field()
    comments_total = scrapy.Field()
    web_url = scrapy.Field()
    official_account = scrapy.Field()

然后再看看pipilines.py,将所有的数据存储为json格式。

import json

class DongqiudiPipeline(object):
    def process_item(self, item, spider):
        with open("DQD.json","a") as f:
            f.write(json.dumps(item,ensure_ascii=False)+"\n")

最后看看settings.py。

BOT_NAME = 'dongqiudi'

SPIDER_MODULES = ['dongqiudi.spiders']
NEWSPIDER_MODULE = 'dongqiudi.spiders'
ROBOTSTXT_OBEY = False  #不遵守机器人协议
#请求头的设置
DEFAULT_REQUEST_HEADERS = {
'Accept': 
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
ITEM_PIPELINES = {
'dongqiudi.pipelines.DongqiudiPipeline': 300,
}

到此所有的要写的代码就简单的完成了。然后看看抓取的结果。


抓取结果
上一篇 下一篇

猜你喜欢

热点阅读