打卡:1-2爬取自己网页的信息

2016-07-18  本文已影响0人  早禾
要爬取的信息来源

爬取的信息的成果展示

image : img/pic_0000_073a9256d9624c92a05dc680fc28865f.jpg
price : $24.99
view : 65 reviews
describe : See more snippets like this online store item at web store 
score : 5
title : EarPod


image : img/pic_0005_828148335519990171_c234285520ff.jpg
price : $64.99
view : 12 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : New Pocket


image : img/pic_0006_949802399717918904_339a16e02268.jpg
price : $74.99
view : 31 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : New sunglasses


image : img/pic_0008_975641865984412951_ade7a767cfc8.jpg
price : $84.99
view : 6 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 3
title : Art Cup


image : img/pic_0001_160243060888837960_1c3bcd26f5fe.jpg
price : $94.99
view : 18 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : iphone gamepad


image : img/pic_0002_556261037783915561_bf22b24b9e4e.jpg
price : $214.5
view : 18 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : Best Bed


image : img/pic_0011_1032030741401174813_4e43d182fce7.jpg
price : $500
view : 35 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : iWatch


image : img/pic_0010_1027323963916688311_09cc2d7648d9.jpg
price : $15.5
view : 8 reviews
describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
score : 4
title : Park tickets

源代码

from bs4 import BeautifulSoupwith open('./index.html', 'r') as wbdata:
    soup = BeautifulSoup(wbdata, 'lxml')
    images = soup.select('div > div.col-md-9 > div > div > div > img')
    titles = soup.select('div.caption > h4:nth-of-type(2) > a')
    prices = soup.select('div.caption > h4.pull-right')
    describes = soup.select('div.caption > p')
    views = soup.select(' div.ratings > p.pull-right')
    scores = soup.select('div > div.ratings > p:nth-of-type(2)')

info = []
for title, image, price, describe, view, score in zip(titles, images, prices, describes, views, scores): 
   data = {
        'title': title.get_text(),
        'image': image.get('src'),
        'price': price.get_text(),
        'describe': describe.get_text(),
        'view': view.get_text(),
        'score': len(score.find_all('span','glyphicon glyphicon-star'))
    }
    info.append(data)

for i in info:
    for a in i:
        print(a, ':', i[a])
    print('\n')

笔记

1、Beautiful Soup不支持Nth-child语法,所以要换成nth-of type(或者去掉这个部分案啦)
2、soup.select()尽量不用完整selector
3、要学着自己看错题集和文档
4、耐心看debug提示信息
5、获得某一标签下的属性可以用get()也可以用find_all()

上一篇下一篇

猜你喜欢

热点阅读