大数据 爬虫Python AI Sql

Python爬取一揽子我爱我家租房信息

2018-12-21  本文已影响6人  不存在的一角

需求

快要毕业了,出来找实习,所以要找个房子租,但是是第一次出来找房子住,所以也不太清楚这边的租房情况,该租个单间带独卫的还是多人间的?房价大概又在多少?哪里的房价又高一点?

所以打算爬取我爱我家上的某地区的租房信息来进行数据分析一波(本篇暂讲解如何爬取)

https://sh.5i5j.com/zufang/

爬取区域选择在上海

一、详细需求

可以看到我爱我家上海租房信息共有13234条,我们需要获取的有房源标题、类型几室几厅、面积、朝向、楼层、地址、发布时间、标签、月租金额、出租方式、所属地区等

二、数据如何加载

来看看网页源代码中

我们所需要的数据基本上都在,这就很舒服

初步分析可以得知,每一页有30条房源信息,也就是说共有430多页的数据

对于这种大小的数据我们一般会想到用 scrapy ,毕竟 scrapy 是基于 twisted 开发的所以其异步的请求方式无疑会提高爬虫的效率,也是爬虫开发的一大利器

但是对于这次的需求,scrapy 貌似有点力不从心;结果我可以这么说,如果用scrapy,当你访问我爱我家的官网的时候,一般来说我们会设置

ROBOTSTXT_OBEY = False

但是这里必须将其设置为 True 才可正确访问到官网,这也就意味着我们遵从了这个机器人协议,会限制我们的爬虫,这对一个爬虫来说是不利的

这还不算什么,更让人头疼的是,你就简单的爬取了几页之后会发现之后怎样都爬取不了数据,为什么?

打印一下响应的内容之后你会发现这么一句:该请求已被网宿云WAF拦截......

简单来说就是你的爬虫已经被识别了,并且人家还封了你的IP

好,那我换个IP访问可以了吧?可以,但是没过多久你又会被封......

也就是说我们还是通过脚本的形式爬取,但是官网的反爬比较严重

那怎么办,该换一个网站爬吗?当然不可能,作为一个虫师,怎么能面对这么点困难就要跳过

重点来了,仔细想想,能不能换种别的方式获取数据?比如我们之前提到过的,微博有三个站点可以获取数据,那么我爱我家会不会有m站,会不会有wap站呢?

wap站没有,但是好在,有我爱我家的m站!

https://m.5i5j.com/sh/zufang/

惊喜之余,对其进行常规的分析可以知道,m站的数据在网页源代码中也有,但是通过获取网页源代码中的数据还是会被封,所以我们还是通过ajax加载数据的接口来获取json数据才是最稳妥的,而且json数据中有我们所需的全部数据,甚至更多

这里的url链接是一样的,但要获取json数据,必须带上请求头中的内容才可以,否则返回的还是网页源代码的数据

为了方便我们可以直接把整个请求头拿过来用,对于得到返回只要 response.json() 就可以取到其中所有的值了

{
            "_index": "shanghaiv1_shexchangehouse",
            "_type": "shexchangehouse",
            "_id": "sale_9_9_38153363_0",
            "_score": 5,
            "_source": {
                "qsdy": "产权清晰,无抵押",
                "heattypeid": null,
                "memo": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
                "location": [
                    121.253125,
                    31.10956
                ],
                "buildage": 7,
                "memo4": "产权清晰,无抵押",
                "housetype": "普通住宅",
                "housetypeid": 1,
                "sqname": "泗泾",
                "memo1": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
                "buildarea": 76,
                "pricechangetimelong": null,
                "buildingfloor": 1,
                "premisespermit": "",
                "tag": [
                    4,
                    8
                ],
                "loopline": null,
                "searchphrase": "一手动迁精装修两房边套有钥匙随时看房家具家电全送,泽悦路325弄1-30号,新凯家园四期茉莉雅苑,松江区,泗泾,songjiangqu,sijing",
                "uptime": 20181219203748864,
                "flag3d": 0,
                "house_quality": "优质房源",
                "unitprice": 26316,
                "contacttime": "随时看房",
                "houseallfloor": 14,
                "housetitle": "一手动迁精装修两房边套有钥匙随时看房家具家电全送",
                "downtime": null,
                "traffic": null,
                "floorPositionStr": "底层",
                "livingroom_cn": "一厅",
                "communityid": 325090,
                "firstuptime": 1521939038968,
                "tags": [
                    "jdbc_logstash_sale_sh"
                ],
                "jtcx": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
                "communityname": "新凯家园四期茉莉雅苑",
                "cityid": 9,
                "gptime": "2017-07-07",
                "memo3": "",
                "subwaystationids": [],
                "heading": "南",
                "x": 121.253125,
                "pre_price": 0,
                "pricechangeflag": 0,
                "y": 31.10956,
                "subway": null,
                "img3d": null,
                "decoratelevel": "精装",
                "headingid": 3,
                "bedroom_cn": "二室",
                "qyspell": "songjiangqu",
                "sqid": 40000067,
                "dkqk": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
                "isdeleted": 0,
                "istop": 0,
                "government_qr": "",
                "joins": 0,
                "rim": "191.45.1845路直达泗泾站",
                "price": 200,
                "hasimg": 1,
                "othertypeid": 1,
                "hxjs": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
                "checkintime": "",
                "toilet_cn": "一卫",
                "pricetrend": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
                "decorate_time": "",
                "cjdatestr": "2018-12-13 22:41:52",
                "housesid": 38153363,
                "floorPositionId": -1,
                "qyname": "松江区",
                "isnew": 0,
                "imgs": [
                    "https://image18.5i5j.com/erp/house/3815/38153363/shinei/mgdamgkfe0d02fed.jpg_P5.jpg",
                    "https://image17.5i5j.com/erp/house/3815/38153363/shinei/elgaegjpe0dcedce.jpg_P5.jpg",
                    "https://image17.5i5j.com/erp/house/3815/38153363/shinei/lnooanoke0de2c1d.jpg_P5.jpg",
                    "https://image18.5i5j.com/erp/house/3815/38153363/shinei/dogeeiane0de3926.jpg_P5.jpg",
                    "https://image17.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P5.jpg",
                    "https://image16.5i5j.com/erp/house/3815/38153363/shinei/nhnfhjnje0d03998.jpg_P5.jpg",
                    "https://image16.5i5j.com/erp/house/3815/38153363/shinei/cfapbkkce0d054c5.jpg_P5.jpg",
                    "https://image17.5i5j.com/erp/house/3815/38153363/shinei/idahicfae0d2837c.jpg_P5.jpg",
                    "https://image18.5i5j.com/erp/house/3815/38153363/shinei/pkkjjomne0d27ad9.jpg_P5.jpg",
                    "https://image16.5i5j.com/erp/house/3815/38153363/huxing/knhhpcmne0c95f57.jpg_P5.jpg"
                ],
                "hxmd": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
                "buildage_cn": "七年",
                "sqspell": "sijing",
                "livingroom": 1,
                "house_quality_id": 2,
                "bedroom": 2,
                "memo2": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
                "updown": 1,
                "parking": "",
                "sfjx": "",
                "bookin_time": "2017-07-07",
                "tagwall": [
                    "随时看",
                    "满二年"
                ],
                "memo5": "小区2012年交房,适合居住",
                "rightprop": "使用权房",
                "toilet": 1,
                "buildyear": 2012,
                "housesexchangescore": 17.9,
                "subwaystations": [],
                "cjflag": 0,
                "subwaylineids": [],
                "updatetimelong": 1545223070844,
                "floortypeid": null,
                "sectionname": "泽悦路325弄1-30号",
                "qyid": 73,
                "decoratelevelid": 3,
                "floortype": "",
                "esid": "sale_9_9_38153363_0",
                "hximg": "",
                "pricechangetime": null,
                "xqxx": "小区2012年交房,适合居住",
                "cjdate": 1544712112000,
                "citycode": null,
                "payment": "70.00",
                "zbpt": "191.45.1845路直达泗泾站",
                "heattype": "",
                "firstuptimestr": "2018-03-25 08:50:38",
                "updatetimestr": "2018-12-19 20:37:50",
                "government_code": "",
                "img3durl": null,
                "imgurl": "https://image16.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P7.jpg",
                "subwaylines": []
            }
        },

之后就是保存到数据库中啦,通过这个接口爬取,完全不用担心被封IP,但是也要注意控制时延,避免对其服务器造成过大负担,要是之后我爱我家对这个接口进行了限制,就很难再获取大量的数据了

三、数据提取中几个注意的点

1、爬取对象为我爱我家的m站,访问接口时需要带上请求头

2、控制时延

四、实战结果

取完我爱我家上海地区租房的信息,共13000多条,可用作之后的数据分析用

十一 十二

代码我就搁这儿了!

#!/usr/bin/python
# -*- coding:utf-8 -*-
# author:joel 18-6-5

import random
import re
import time
import pymysql
import requests

# CREATE TABLE `wiwj_sh_zufang` (
#   `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
#   `house_id` char(16) NOT NULL,
#   `house_url` varchar(127) NOT NULL,
#   `house_jpg` varchar(512) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '封面图',
#   `house_title` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
#   `house_type` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
#   `house_buildarea` varchar(64) CHARACTER SET utf8mb4 NOT NULL,
#   `house_heading` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
#   `house_floor` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
#   `house_decoratelevel` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
#   `house_place` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
#   `house_firstuptime` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
#   `house_price` int(16) NOT NULL,
#   `house_renttype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
#   `house_paytype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
#   `house_area` varchar(32) CHARACTER SET utf8mb4 NOT NULL,
#   `house_tags` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
#   `house_subwaylines` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
#   `house_traffic` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
#   `house_quality` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
#   PRIMARY KEY (`id`),
#   KEY `houseid` (`house_id`) USING BTREE
# ) ENGINE=InnoDB AUTO_INCREMENT=2671 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;


class Wiwj(object):
    def __init__(self):
        """
        13014 按每页30个 共有434页
        """
        self.start_url = 'https://m.5i5j.com/sh/zufang/index-n{}'
        # 只添加'x-requested-with' 可能获取不到json数据,可以直接把整个请求头加上
        self.headers = {
            'accept': 'application/json, text/javascript, */*; q=0.01',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'cache-control': 'no-cache',
            'cookie': '',
            'pragma': 'no-cache',
            'referer': 'https://m.5i5j.com/sh/zufang/index',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest',
        }

    def gethouselist(self):
        """ 5i5j 上海租房 """
        for page in range(1, 435):
            print("-----------------------" + str(page) + "------------------")
            r = requests.get(self.start_url.format(page), headers=self.headers)
            result = r.json()
            houses = result['houses']
            # print(r.json())
            for i in range(0, len(houses)):
                # print(houses[i]['_source']['housesid'])
                house_id = houses[i]['_source']['housesid']
                house_url = 'https://m.5i5j.com/sh/zufang/{}.html'.format(houses[i]['_source']['housesid'])
                house_jpg = houses[i]['_source']['imgurl']
                house_title = houses[i]['_source']['housetitle']
                house_type = houses[i]['_source']['bedroom_cn'] + houses[i]['_source']['livingroom_cn'] + houses[i]['_source']['toilet_cn']
                house_buildarea = houses[i]['_source']['area']
                house_heading = houses[i]['_source']['heading']
                house_floor = houses[i]['_source']['floorPositionStr'] + '/' + str(houses[i]['_source']['houseallfloor'])
                house_decoratelevel = houses[i]['_source']['decoratelevel']
                house_place = str(houses[i]['_source']['sqname']) + ' ' + str(houses[i]['_source']['communityname'])
                house_firstuptime = houses[i]['_source']['firstuptimestr']
                house_price = houses[i]['_source']['price']
                house_renttype = houses[i]['_source']['rentmodename']
                house_paytype = houses[i]['_source']['pay']
                house_area = houses[i]['_source']['qyname']
                house_tag = houses[i]['_source']['tagwall']
                house_tags = ','.join(house_tag)
                house_subwayline = houses[i]['_source']['subwaylines']
                house_subwaylines = ','.join(house_subwayline)
                house_traffic = houses[i]['_source']['traffic']
                house_quality = houses[i]['_source']['house_quality']
                # print(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                #       house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                #       house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
                self.insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                      house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                      house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
            time.sleep(random.randint(0, 2))

    @staticmethod
    def insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                      house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                      house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality):
        conn = pymysql.connect(host='', port=, user='', passwd='', db='wiwj')
        cursor = conn.cursor()

        insert_sql = "insert into `wiwj_sh_zufang` (`house_id`, `house_url`, `house_jpg`, `house_title`, " <br  />                     "`house_type`, `house_buildarea`, `house_heading`, `house_floor`, `house_decoratelevel`, " <br  />                     "`house_place`, `house_firstuptime`, `house_price`, `house_renttype`, `house_paytype`, " <br  />                     " `house_area`, `house_tags`, `house_subwaylines`, " <br  />                     "`house_traffic`, `house_quality`)values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s', " <br  />                     "'%s','%s','%s','%s','%s','%s','%s','%s','%s')" % <br  />                     (house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                      house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype, house_paytype,
                      house_area, house_tags, house_subwaylines, house_traffic, house_quality)
        select_sql = "select `house_id` from `wiwj_sh_zufang` where `house_id`='%s'" % house_id

        try:
            response = cursor.execute(select_sql)
            conn.commit()
            if response == 1:
                print(u'该房源存在...')
            else:
                try:
                    cursor.execute(insert_sql)
                    conn.commit()
                    print(u'房源插入成功...')
                except Exception as e:
                    print(u'房源插入错误...', e)
                    conn.rollback()
        except Exception as e:
            print(u'查询错误...', e)
            conn.rollback()
        finally:
            cursor.close()
            conn.close()


if __name__ == '__main__':
    """ 上海 """
    wiwj = Wiwj()
    wiwj.gethouselist()

print('微信公众号搜索 "猿狮的单身日常" ,Java技术升级、虫师修炼,我们 不见不散!')
print('也可以扫下方二维码哦~')
猿狮的单身日常
上一篇下一篇

猜你喜欢

热点阅读