打卡：1-3爬取真实的网络数据

2016-07-19 本文已影响0人早禾

【要爬取的数据来源】

❤在列表页爬取详细介绍的网址
❤在详细介绍爬取房源具体信息：
标题、地址、价格、第一幅图片、房主姓名、性别、头像

列表页

详情页

【成果（将其输出到TXT文件中了）】

【代码】

from bs4 import BeautifulSoup
import requests
import time

#一共13页房源信息
list_urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(i) for i in range(1, 14)]

#此函数用于判断房主性别
def sexual(sex):
    if sex == ['member_ico']:
        return 'male'
    elif sex == ['member_ico1']:
        return 'female'
    else:
        return 'unknown'
    
#此函数用于获取列表页的24个详情页地址，返回值为一个列表
def getList (list_url):
    respond = requests.get(list_url)
    wb_data = BeautifulSoup(respond.text, 'lxml')
    get_urls = wb_data.select('#page_list > ul > li > a')
    urls = []
    for a_url in get_urls:
        url = a_url.get('href')
        urls.append(url)
    return urls

#此函数用于获取详情页的各种信息，返回值为一个字典
def getDetail(url):
    respond = requests.get(url)
    wb_data = BeautifulSoup(respond.text, 'lxml')
    titles = wb_data.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em')
    positions = wb_data.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span.pr5')
    prices = wb_data.select('#pricePart > div.day_l > span')
    images = wb_data.select('#detailImageBox > div.pho_show_l > div > div:nth-of-type(2) > img')
    names = wb_data.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
    host_images = wb_data.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
    sexs = wb_data.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')

    for title, position, price, image, name, host_image, sex in zip(titles, positions, prices, images, names, host_images,
                                                                sexs):
        detail = {
            'title': title.get_text(),
            'position': position.get_text().replace('\n                                  ', ''),
            'price': price.get_text(),
            'image': image.get('src'),
            'name': name.get_text(),
            'host_image': host_image.get('src'),
            'sex': sexual(sex.get('class'))
        }
        return detail

#用于放置300个房源地址的列表
urls = []

#每获取24个房源地址，若列表小于300，则把该地址放入列表中
for list_url in list_urls:
    time.sleep(2)
    urls1 = getList (list_url)
    for urls2 in urls1:
        if len(urls) < 300:
            urls.append(urls2)

    print(len(urls))

#输出300个地址
print(urls)

#打开一个文件，续写模式
path = 'C:/Users/Administrator/DeskTop/sores.txt'
file = open(path,'a')

#计算该房源是第几个房
count = 0
#开始爬取和输出300个房源的具体信息
for url in urls:
    count = count + 1
    if count % 3 == 0:#判断进度
        print (count/3,'%')

    file.write('\n\ncount = '+ str(count) +'\n')
    time.sleep(2)
    file.write(str(getDetail(url)))

file.close()
print('down')
#结束

【debug日记~】

通过select和get('class')过滤得到的性别其实是一个列表，一开始用字符串来判断总是跳到else，而且ico1那个1是数字1而不是英文字母l
轮播图其实有用到js进行控制，所以没办法直接取到图片的selector，但是可以从父级地址中自己添加一个> img 来获取第一张图的地址
一开始取不到头像地址还以为怎么了，换了手机端的header还是不行，结果是忘记写select(蠢到家……）
借助sleep来延迟爬取时间进行保护确实很慢，要学学看新的办法才行
一个比较基础的问题，write只能写入字符串，所以要把字典转换成字符串才行（强制转化还真是好方便，life is short,you need python）
User-Agent中间的不是下划线！
找唯一特征时（这次因为每个元素都是单一的所以没怎么用到），一般用.来定位class，用[]来定位属性
愉快地结束了！（一个小tips，如果按照教程总是爬不到所需要的信息，可以留意看看respond的网页内容是不是有所更改~）

打卡：1-3爬取真实的网络数据

【要爬取的数据来源】

【成果（将其输出到TXT文件中了）】

【代码】

【debug日记~】

猜你喜欢

热点阅读