python实战计划第二周,第一个项目

2016-05-16  本文已影响0人  kaurala

爬取短租房前三页,并将数据存储在mongodb中,打印出大于等于500元的租房信息。
代码:

import requests
from bs4 import BeautifulSoup
import pymongo

client = pymongo.MongoClient('localhost')
duanzufang = client['dzf']
list = duanzufang['list']

urls = ['http://bj.xiaozhu.com/search-duanzufang-p' + str({}).format(str(i)) + '-0/' for i in range(1, 4)]
head = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'}

for url in urls:
    wb_data = requests.get(url, headers = head)
    soup = BeautifulSoup(wb_data.text, 'lxml')

    titles = soup.find_all('span', {'class': 'result_title hiddenTxt'})
    infos = soup.find_all('em', {'class': 'hiddenTxt'})
    prices = soup.find_all('span', {'class': 'result_price'})

    for title, info, price in zip(titles, infos, prices):
        data = {
        'title': title.get_text(),
        'typo': info.get_text().replace('\n', '').replace(' ', '').split('-')[0],
        'comment_num': info.get_text().replace('\n', '').replace(' ', '').split('-')[1],
        'address': info.get_text().replace('\n', '').replace(' ', '').split('-')[2],
        'price': int(price.i.get_text())
        }
        list.insert_one(data)

for item in list.find({'price': {'$gte': 500}}):
    print (item)

总结:
1、理解了网页的结构
2、通过研读Bs4文档,学会了find系列函数用法
3、学会数据库建立以及输入数据

问题:
如果不止爬取前三页,想爬取所有页,观察了底下页码发现是动态变化的,请问老师这种情况应该怎么爬取呢?

上一篇下一篇

猜你喜欢

热点阅读