Python实战计划学习笔记0629
2016-06-29 本文已影响0人
个十滴水
实战计划第一天,抓了一个本地网页。
最终成果是这样的:
Paste_Image.png我的代码:
from bs4 import BeautifulSoup
info = []
with open('E:/PycharmProjects/homework2/homework2/1_2_homework_required/index.html','r') as data:
Soup = BeautifulSoup(data,'lxml')
images = Soup.select('body > div > div > div.col-md-9 > div > div > div > img')
titles = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
grades = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
counts = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
# print(images,titles,grades,prices,counts)
for title,image,price,grade,count in zip(titles,images,prices,grades,counts):
data1 = {
'title' : title.get_text(),
'image' : image.get('src'),
'price' : price.get_text(),
'grade' : len(grade.find_all("span" , class_ = "glyphicon glyphicon-star" )),
'count' : count.get_text()
}
print(data1)
info.append(data1)
总结
- lxml在内的三种解析方式
- :nth-child(1)>img 代表具体到每一个子节点,抓所有元素时要删除或 变成nth-of-type
- 步骤1.soup解析2.复制CSS path(注意格式要对,尤其空格等)3.筛选信息4.字典扩充info.append(data1)
- ()tupple []list {}dic
- grade和grades区别:抓网页时grades是父节点个数,grade是每个父节点下星星构成的list