贝壳网武汉二手房数据分析———数据采集

2019-11-17  本文已影响0人  一半芒果
思路:

1、贝壳网武汉二手房网页url:https://wh.ke.com/ershoufang/
2、使用scrapy框架,通过循环访问共100个页面,每页30个房源信息;
3、获取标题描述、楼盘信息、房屋标签、 总价、单价、楼层、建筑时间、户型、朝向、发布时间、关注人数等信息;
4、使用xpath解析页面数据;
5、保存为CSV表格;

一、准备工作
 scrapy startproject BKZF
 scrapy genspider beike ke.com
二、构建框架

(1)items.py / 定义item

import scrapy

class ItemItem(scrapy.Item):
    detailinfo = scrapy.Field()
    info = scrapy.Field()
    location= scrapy.Field()
    followinfo=scrapy.Field()
    tag = scrapy.Field()
    totalprice = scrapy.Field()
    unitprice = scrapy.Field()

(2) spider.py

import scrapy
from ITEM.items import ItemItem

class BeikeSpider(scrapy.Spider):
    name = 'beike'
    allowed_domains = ['ke.com']
    baseurl = 'https://wh.ke.com/ershoufang/PG{}/'
    start_urls =[]
    for i in range(1,101):
        url= baseurl.format(i)
        start_urls.append(url)

    def parse(self, response):
        room_list = response.xpath('//*[@id="beike"]//ul[@class="sellListContent"]//div[@class="info clear"]')
        #print('长度:',len(room_list))
        for i in room_list:
            item = ItemItem()
            item['detailinfo'] = i.xpath('.//div[@class="title"]/a/text()').extract()[0]
            item['info'] = i.xpath('.//div[@class="houseInfo"]/text()').extract()[1].strip().replace(' ','').replace('\n','')
            item['location'] = i.xpath('.//div[@class="flood"]/div[@class="positionInfo"]/a/text()').extract_first().strip()
            item['followinfo'] = i.xpath('.//div[@class="followInfo"]/text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
            item['tag']= i.xpath('.//div[@class="tag"]//text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
            item['totalprice'] = i.xpath('.//div[@class="totalPrice"]/span/text()').extract_first() 
            item['unitprice'] = i.xpath('.//div[@class="unitPrice"]//@data-price').extract_first()
            yield item

(3) middlewares.py

import random
#添加User-Agent
class ItemDownloaderMiddleware(object):
    def __init__(self):
        self. user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]

    def process_request(self, request, spider):
        ug = random.choice(self.user_agent_list)
        request.headers['User-Agent'] = ug
        return None

    def process_response(self, request, response, spider):
        print(request.headers['User-Agent'])
        return response

(4)pipelines.py

from scrapy.exporters import CsvItemExporter
#数据持久化,保存CSV格式
class ItemPipeline(object):
    def open_spider(self, spider):
        self.file = open('beike.csv', 'wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

(5)一切准备就绪别忘记setting
一般写好一部分代码就开启相应的设置,以防忘记

LOG_FILE = 'beike.log'
LOG_LEVEL = 'INFO'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DOWNLOADER_MIDDLEWARES = {
   'ITEM.middlewares.ItemDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
   'ITEM.pipelines.ItemPipeline': 300,
}
AUTOTHROTTLE_MAX_DELAY = 60

三、运行spider

scrapy crawl beike

四、查看数据


image.png
上一篇下一篇

猜你喜欢

热点阅读