Scrapy学习记录2

2017-05-08  本文已影响0人  枫落柠

标签: 信息检索


1. 创建一个Scrapy项目

scrapy startproject tutorial
1.jpg

2. 定义提取的Item

import scrapy 
class DmozItem(scrapy.Item):
        title=scrapy.Field()
        link=scrapy.Field()
        desc=scrapy.Field()

3. 编写爬取网站的 spider 并提取 Item

3.1编写初始spider

import scrapy
class DmozSpider(scrapy.Spider):
        name="dmoz"
        allowed_domains=["dmoz.org"]
        start_urls=[
        "http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
        "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
        ]
        def parse(self, response):
              filename=response.url.split("/")[-2]
              with open(filename, 'wb') as f:
                     f.write(response.body)

3.2爬取

scrapy crawl dmoz

4. 存储提取到的Item(即数据)

4.1提取数据

6.PNG 8.PNG

4.2修改spider提取数据

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
        "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

4.3 保存爬取的数据

scrapy crawl dmoz -o items.json

20.PNG

阅读材料:
scrapy官方文档

上一篇 下一篇

猜你喜欢

热点阅读