py爬虫我爱编程

scrapy + mongodb +redis 实现爬虫

2017-03-16  本文已影响246人  a十二_4765

1. 安装scrapy  pip install scrapy

    安装scrapy-redis   pip install scrapy

2.安装mongodb 

mongo.exe 服务端 mongod.exe客户端

安装mongodb服务  存放在F盘下php/mongodb

F:\php\mongodb\bin>dir  查看目录

mongo --dbpath F:/php/mongodb  F:\php\mongodb 表示数据存放位置

启动mongo 安装

mongod.exe  --dbpath F:/php/mongodb/bin/

在启动一个cmd  然后进入到bin目录下  输入mongo.exe

py安装pymongo

pip install pymongo

3安装readis

爬取目标:彩票网站开奖的数据  http://www.bwlc.net/

首选创建爬虫

scrapy startproject fucai 

进入 目录 >fucai

然后创建爬虫 scrpy genspider ff 

在 items.py 进行编进

import scrapy

class FucaiItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

qihao =scrapy.Field()

kaijiang =scrapy.Field()

riqi =scrapy.Field()

然后进入spiders 目录里对ff.py进行编辑

import scrapy

from scrapy.http import Request

from fucai.items import FucaiItem

from scrapy_redis.spiders import RedisSpider

class FfSpider(RedisSpider):

name = "ff"

redis_key='ff:start_urls'

allowed_domains = ["bwlc.net"]

def start_requests(self):

for url in self.start_urls:

yield Requset(url=url,callback=self.parse)

def parse(self, response):

url= response.xpath('//div[@class="fc_fanye"]/span[2]/b[@class="col_red"]/text()').extract()

print url

for j in range(1,3):

page = "http://www.bwlc.net/bulletin/prevqck3.html?page="+str(j)

yield Request(url=page,callback=self.next2)

def next2(self,response):

urla = response.xpath('//tr[@class]')

for i in urla:

item = FucaiItem()

item["qihao"]=i.xpath('td/text()').extract()[0]

item["kaijiang"] =i.xpath('td/text()').extract()[1]

item["riqi"] =i.xpath('td/text()').extract()[2]

yield item

配置settings.py

SCHEDULER ="scrapy_redis.scheduler.Scheduler"

DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"

SCHEDULER_PERSIST = True

SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.SpiderQueue'

ITEM_PIPELINES = {

'fucai.pipelines.FucaiPipeline':300,

}

MONGODB_HOST='127.0.0.1'

MONGODB_POST =27017

MONGODB_DBNAME='jike'

MONGODB_DOCNAME='reada'

进入pipelines

在redis 里输入要爬取的内容

然后 scrapy crawl  ff  进行爬取

如有问题 请加qq:1158219108
上一篇下一篇

猜你喜欢

热点阅读