大数据 爬虫Python AI SqlPython小哥哥

Python scrapy框架爬取瓜子二手车信息数据!

2019-05-09  本文已影响0人  14e61d025165

项目实施依赖:

python,scrapy ,fiddler

scrapy安装依赖的包:

可以到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下载 pywin32,lxml,Twisted,scrapy然后pip安装

项目实施开始:

Python学习交流群:1004391443,这里是python学习者聚集地,有大牛答疑,有资源共享!小编也准备了一份python学习资料,有想学习python编程的,或是转行,或是大学生,还有工作中想提升自己能力的,正在学习的小伙伴欢迎加入学习。

1、创建scrapy项目:cmd中cd到需创建的文件目录下

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy startproject guazi
View Code
</pre>

2、创建爬虫:cd到创建好的项目下

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">1 scrapy genspider gz guazi.com
View Code
</pre>

3、分析目标网址:

第一次我直接用的谷歌浏览器的抓包分析,取得UA和Cookies请求,返回的html数据完全缺失,分析可能是携带的Cookies

有问题,然后就用fiddler抓包才,得到Cookies与谷歌上得到Cookies多了UA,时间等参数,

4、将UA,Cookies添加到下载中间中去:

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">1 class Guzi1DownloaderMiddleware(object):
2 def process_request(self, request, spider):
3 # 需要对得到的cookies处理成字典类型
4 request.cookies={}
5 request.headers["User-Agent"]=""
View Code
</pre>

5、在settings中将DOWNLOADER_MIDDLEWARES打开

6、在spiders目录下找到gz.py开始编写爬虫逻辑处理

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> 1 import scrapy
2 import time
3
4 class GzSpider(scrapy.Spider):
5 name = 'gz'
6 allowed_domains = ['guazi.com']
7 start_urls = ['https://www.guazi.com/cd/buy/0']
8
9 def parse(self, response):
10 # 得到页面上所有车辆的url
11 url_list = response.xpath('//ul[@class="carlist clearfix js-top"]//li/a/@href').extract()
12 url_list = [response.urljoin(url) for url in url_list]
13 url_list = [url.replace("cq", "cd") for url in url_list]
14 for url in url_list:
15 yield scrapy.Request(url=url, callback=self.parse1, dont_filter=True)
16
17 # 获取下一页的url
18 next_url = response.urljoin(response.xpath('//span[text()="下一页"]/../@href').extract_first())
19 if next_url:
20 yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=True)
21 time.sleep(2)
22
23 def parse1(self, response):
24 # 判断是否有数据
25 if response.xpath('//h2/text()').extract_first():
26 print(response.xpath('//h2/text()').extract_first().strip())
27 item = {}
28 item["车型"] = response.xpath('//h2/text()').extract_first().strip()
29 item["选车类型"] = response.xpath('//h2/span/text()').extract_first()
30 item["价格/万"] = response.xpath('//div[@class="pricebox js-disprice"]/span[1]/text()').extract_first().strip()
31 item["新车价格"] = response.xpath('//div[@class="pricebox js-disprice"]/span[2]/text()').extract_first().strip()
32 item["上牌时间"] = response.xpath('//ul[@class="basic-eleven clearfix"]/li[1]/div/text()').extract_first().strip()
33 item["公里数"] = response.xpath('//ul[@class="basic-eleven clearfix"]/li[2]/div/text()').extract_first().strip()
34 item["排量"] = response.xpath('//ul[@class="basic-eleven clearfix"]/li[3]/div/text()').extract_first().strip()
35 item["变速箱"] = response.xpath('//ul[@class="basic-eleven clearfix"]/li[4]/div/text()').extract_first().strip()
36 item["配置信息"] = response.xpath('//span[@class="type-gray"]//text()').extract()
37 item["网址"] = response.url
38 yield item
View Code
</pre>

7、启动爬虫并保存为csv文件

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl gz -o guanzi.csv
View Code
</pre>

8、最后得到了想要的二手车信息,贴上部分截图

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557401600825" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

上一篇 下一篇

猜你喜欢

热点阅读