贝壳网武汉二手房数据分析———数据采集
2019-11-17 本文已影响0人
一半芒果
思路:
1、贝壳网武汉二手房网页url:https://wh.ke.com/ershoufang/;
2、使用scrapy框架,通过循环访问共100个页面,每页30个房源信息;
3、获取标题描述、楼盘信息、房屋标签、 总价、单价、楼层、建筑时间、户型、朝向、发布时间、关注人数等信息;
4、使用xpath解析页面数据;
5、保存为CSV表格;
一、准备工作
- 创建一个scrapy project:
scrapy startproject BKZF
- 创建spider file
scrapy genspider beike ke.com
二、构建框架
(1)items.py / 定义item
import scrapy
class ItemItem(scrapy.Item):
detailinfo = scrapy.Field()
info = scrapy.Field()
location= scrapy.Field()
followinfo=scrapy.Field()
tag = scrapy.Field()
totalprice = scrapy.Field()
unitprice = scrapy.Field()
(2) spider.py
import scrapy
from ITEM.items import ItemItem
class BeikeSpider(scrapy.Spider):
name = 'beike'
allowed_domains = ['ke.com']
baseurl = 'https://wh.ke.com/ershoufang/PG{}/'
start_urls =[]
for i in range(1,101):
url= baseurl.format(i)
start_urls.append(url)
def parse(self, response):
room_list = response.xpath('//*[@id="beike"]//ul[@class="sellListContent"]//div[@class="info clear"]')
#print('长度:',len(room_list))
for i in room_list:
item = ItemItem()
item['detailinfo'] = i.xpath('.//div[@class="title"]/a/text()').extract()[0]
item['info'] = i.xpath('.//div[@class="houseInfo"]/text()').extract()[1].strip().replace(' ','').replace('\n','')
item['location'] = i.xpath('.//div[@class="flood"]/div[@class="positionInfo"]/a/text()').extract_first().strip()
item['followinfo'] = i.xpath('.//div[@class="followInfo"]/text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
item['tag']= i.xpath('.//div[@class="tag"]//text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
item['totalprice'] = i.xpath('.//div[@class="totalPrice"]/span/text()').extract_first()
item['unitprice'] = i.xpath('.//div[@class="unitPrice"]//@data-price').extract_first()
yield item
(3) middlewares.py
import random
#添加User-Agent
class ItemDownloaderMiddleware(object):
def __init__(self):
self. user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
def process_request(self, request, spider):
ug = random.choice(self.user_agent_list)
request.headers['User-Agent'] = ug
return None
def process_response(self, request, response, spider):
print(request.headers['User-Agent'])
return response
(4)pipelines.py
from scrapy.exporters import CsvItemExporter
#数据持久化,保存CSV格式
class ItemPipeline(object):
def open_spider(self, spider):
self.file = open('beike.csv', 'wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
(5)一切准备就绪别忘记setting
一般写好一部分代码就开启相应的设置,以防忘记
LOG_FILE = 'beike.log'
LOG_LEVEL = 'INFO'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DOWNLOADER_MIDDLEWARES = {
'ITEM.middlewares.ItemDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'ITEM.pipelines.ItemPipeline': 300,
}
AUTOTHROTTLE_MAX_DELAY = 60
三、运行spider
scrapy crawl beike
四、查看数据
image.png