Python实战计划学习笔记1.2:解析一个网页

2016-12-17  本文已影响83人  纷飞清扬

解析一个本地网页,获取标题,图片地址,价格,评分量和评分星级。
网页如下

作业1.2.png

代码

from bs4 import BeautifulSoup
with open('D:\宣宣\homework/index.html','r') as wb_data:
    soup = BeautifulSoup(wb_data,'lxml') #解析网页内容
    images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
    tittles = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
    prices = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
    reviews = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
    stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')



   # print(images,tittles,price,reviews,stars,sep= '\n--------------\n')

    for tittle,image,price,review,star in zip(tittles,images,prices,reviews,stars):
        data = {
            'tittle':tittle.get_text(), #提取文本信息
            'image':image.get('src'), #提取图片地址src是地址参数
            'price':price.get_text(),
            'review':review.get_text(),
            'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))
        }
        print(data)
'''
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3)
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p.pull-right
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right

运行结果

122.png

总结

1.用Python爬取网页信息,首先得对网页有基本的了解。知道如何在浏览器查询对应图片、文字的HTML代码。再通过copy CSS selector进行有用信息的提取
2.在星级提取中,stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)'),copy CSS selector是body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3),开始没把最后的span:nth-child(3)这一串去掉,结果star=0.后来才明白要提取总共多少个星星,应该写到父级标签 p:nth-child(2) ,才会统计所有。nth-child是会出错的。应改为nth-of-type(2),意为选择器匹配属于父元素的特定类型的第 2个子元素的每个元素。
3.通过不停的出错,对照答案,查文档,对代码的理解加深的。最后运行代码成功,又是一件喜悦的事情,学习动力持续不断。

上一篇 下一篇

猜你喜欢

热点阅读