Python网络数据采集1-Beautifulsoup的使用

2017-07-15 本文已影响186人 sunhaiyu

Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》，例子是照搬的，觉得跟着敲一遍还是有作用的，所以记录下来。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(res.text, 'lxml')
print(soup.h1)

<h1>An Interesting Title</h1>

使用urllib访问页面是这样的，read返回的是字节，需要解码为utf-8的文本。像这样a.read().decode('utf-8')，不过在使用bs4解析时候，可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(a, 'lxml')
print(soup.h1)

<h1>An Interesting Title</h1>

抓取所有CSS class属性为green的span标签，这些是人名。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/warandpeace.html')
soup = BeautifulSoup(res.text, 'lxml')
green_names = soup.find_all('span', class_='green')
for name in green_names:
    print(name.string)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
...

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代，而后代标签则包括了父标签下面所有的子子孙孙。通俗来说，descendant包括了child。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').children
for name in gifts:
    print(name)

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
![](../img/gifts/img1.jpg)
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
![](../img/gifts/img2.jpg)
</td></tr>

找到表格后，选取当前结点为tr，并找到这个tr之后的兄弟节点，由于第一个tr为表格标题，这样的写法能提取出所有除开表格标题的正文数据。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').tr.next_siblings
for name in gifts:
    print(name)

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
![](../img/gifts/img1.jpg)
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
![](../img/gifts/img2.jpg)
</td></tr>

查找商品的价格，可以根据商品的图片找到其父标签<td>，其上一个兄弟标签就是价格。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
price = soup.find('img', src='../img/gifts/img1.jpg').parent.previous_sibling.string
print(price)

$15.00

采集所有商品图片，为了避免其他图片乱入。使用正则表达式精确搜索。

import re
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
imgs= soup.find_all('img', src=re.compile(r'../img/gifts/img.*.jpg'))
for img in imgs:
    print(img['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

find_all()还可以传入函数，对这个函数有个要求：就是其返回值必须是布尔类型，若是True则保留，若是False则剔除。

import re
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
# lambda tag: tag.name=='img'
tags = soup.find_all(lambda tag: tag.has_attr('src'))
for tag in tags:
    print(tag)

![](../img/gifts/logo.jpg)
![](../img/gifts/img1.jpg)
![](../img/gifts/img2.jpg)
![](../img/gifts/img3.jpg)
![](../img/gifts/img4.jpg)
![](../img/gifts/img6.jpg)

tag是一个Element对象，has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中，下面的写法返回的结果一样。
lambda tag: tag.has_attr('src')或lambda tag: tag.name=='img'

by @sunhaiyu

2017.7.14

Python网络数据采集1-Beautifulsoup的使用

Python网络数据采集1-Beautifulsoup的使用

猜你喜欢

热点阅读