麻瓜编程·python实战·1-2作业：爬取商品信息

2016-08-09 本文已影响0人 bbjoe

我的结果：

Paste_Image.png

我的代码：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

#建列表稍后整合评分星级
rates = []

# 解析网页后获取图片（image_urls）、价格（product_prices）、
# 标题（product_titles）、评论数（comment_numbs）、星级（product_rates）
with open('/Users/Administrator/Desktop/PycharmProjects/OReillyWebScraping/小白\html/1_2answer_of_homework/index.html', 'r') \
        as html_data:
    soup = BeautifulSoup(html_data, 'lxml')
    image_urls = soup.select \
        ('body > div > div > div.col-md-9 > div:nth-of-type(2) > div > div > img')
    product_prices = soup.select \
        ('body > div > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.caption > h4.pull-right')
    product_titles = soup.select \
        ('body > div > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.caption > h4:nth-of-type(2) > a')
    comment_numbs = soup.select \
        ('body > div > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.ratings > p.pull-right')
# 经分析商品和其星级存在一对多的关系，所以这里获取的是父级
    product_rates = soup.find_all('div', class_='ratings')  

# 把星级单独处理，最后的结果收纳到rates[]列表中
for i in product_rates:
    star = str(i).count('star')
    empty = str(i).count('empty')
    rates.append(star - empty)

# 这里用zip()函数做一个词典作为结果
for image, price, title, comment, rate in \
        zip (image_urls, product_prices, product_titles, comment_numbs, rates):

    data = {
        'image' : image.get('src'),
        'price' : price.get_text(),
        'title' : title.get_text(),
        'comment' : comment.get_text().replace(' reviews', ''),
        'rate' : rate
    }
    print(data)

我的感受：

我在find_all获取星级那里卡了一下，一直想找一个直接能返回数字的方法，但是我错了。后来当我想到“一对多→找父级”的时候我被点醒了，从获取的列表中取出字符串，再在字符串上面下功夫，比如count()函数。最后问题解决了。
这个作业花费了我1个小时45分钟。
渐渐地体会到“learn by doing”的意义了，因为学习过程中琐碎的点太多，亲自动手不仅可以加深记忆，而且也让你对解决问题的思路慢慢有了体系。

麻瓜编程·python实战·1-2作业：爬取商品信息

我的结果：

我的代码：

我的感受：

猜你喜欢

热点阅读