scrapy + mongodb +redis 实现爬虫
1. 安装scrapy pip install scrapy
安装scrapy-redis pip install scrapy
2.安装mongodb
mongo.exe 服务端 mongod.exe客户端
安装mongodb服务 存放在F盘下php/mongodb
F:\php\mongodb\bin>dir 查看目录
mongo --dbpath F:/php/mongodb F:\php\mongodb 表示数据存放位置
启动mongo 安装
mongod.exe --dbpath F:/php/mongodb/bin/
在启动一个cmd 然后进入到bin目录下 输入mongo.exe
py安装pymongo
pip install pymongo
3安装readis
爬取目标:彩票网站开奖的数据 http://www.bwlc.net/
首选创建爬虫
scrapy startproject fucai
进入 目录 >fucai
然后创建爬虫 scrpy genspider ff
在 items.py 进行编进
import scrapy
class FucaiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
qihao =scrapy.Field()
kaijiang =scrapy.Field()
riqi =scrapy.Field()
然后进入spiders 目录里对ff.py进行编辑
import scrapy
from scrapy.http import Request
from fucai.items import FucaiItem
from scrapy_redis.spiders import RedisSpider
class FfSpider(RedisSpider):
name = "ff"
redis_key='ff:start_urls'
allowed_domains = ["bwlc.net"]
def start_requests(self):
for url in self.start_urls:
yield Requset(url=url,callback=self.parse)
def parse(self, response):
url= response.xpath('//div[@class="fc_fanye"]/span[2]/b[@class="col_red"]/text()').extract()
print url
for j in range(1,3):
page = "http://www.bwlc.net/bulletin/prevqck3.html?page="+str(j)
yield Request(url=page,callback=self.next2)
def next2(self,response):
urla = response.xpath('//tr[@class]')
for i in urla:
item = FucaiItem()
item["qihao"]=i.xpath('td/text()').extract()[0]
item["kaijiang"] =i.xpath('td/text()').extract()[1]
item["riqi"] =i.xpath('td/text()').extract()[2]
yield item
配置settings.pySCHEDULER ="scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.SpiderQueue'
ITEM_PIPELINES = {
'fucai.pipelines.FucaiPipeline':300,
}
MONGODB_HOST='127.0.0.1'
MONGODB_POST =27017
MONGODB_DBNAME='jike'
MONGODB_DOCNAME='reada'
进入pipelines在redis 里输入要爬取的内容
然后 scrapy crawl ff 进行爬取
如有问题 请加qq:1158219108