Python爬虫入门-3.Beautiful Soup库入门

2019-06-01 本文已影响0人波波在敲代码

1. 安装

cmd命令行下，BeautifulSoup的安装：

pip3 install beautifulsoup4

使用下面的网站进行测试：

import requests
vUrl = requests.get("http://python123.io/ws/demo.html")
vUrl.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")
print(vSoup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

2.Beautifulsoup的元素

Beautiful Soup库解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(data, "html.parser")	安装bs4库
lxml的HTML解析器	BeautifulSoup(data, "lxml")	pip3 install lxml
lxml的XML解析器	BeautifulSoup(data, "xml")	pip3 install lxml
html5lib的解析器	BeautifulSoup(data, "html5lib")	pip3 install html5

Beautiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的单元组织，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>...</p>，的名字是"p"，格式：`<tag>`.name
Attributes	标签的属性，字典形式组织，格式：`<tag>`.attrs
NavigableString	标签内非属性字符串，<>...</>中文字符，格式：`<tag>`.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")
vSoup.title # 返回title标签
vTag = vSoup.a # 将a标签存进一个变量中
print(vTag.name) # 返回a的名字
print(vSoup.a.parent.name) # 等同于vTag.parent.name 父结构的名字
print(vSoup.a.parent.parent.name)
print(vTag.attrs) # 获得字典格式的属性内容
print(vTag.attrs["class"]) # 查看class属性
print(vTag.attrs["href"]) # 查看href属性
print(type(vTag.attrs)) # 查看类型
print(type(vTag)) # 查看类型
print(vTag.string)

a
p
body
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
Basic Python

3. Beautiful Soup元素的遍历

分为下行遍历、上行遍历以及平行遍历。

下行遍历

属性	说明	数据类型
.contents	子节点的列表，将<tag>所有子节点存入列表	列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历子节点	迭代类型
.descendants	子孙节点的跌点类型，包含所有子孙节点，用于遍历循环	迭代类型

import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")

print(vSoup.head)
print(vSoup.head.contents)

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]

print(vSoup.body.contents)

['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

len(vSoup.body.contents)

vSoup.body.contents[1]

<p class="title"><b>The demo python introduces several python courses.</b></p>

vSoup.body.contents[2]

'\n'

vSoup.body.contents[3]

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

标签树的下行遍历代码：

### 遍历子节点
for vChild in vSoup.body.children:
    print(vChild)

### 遍历子孙节点
for vChild in vSoup.body.children:
    print(vChild)

上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点的先辈标签

import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")

vSoup.title.parent # tltle的父标签是head

<head><title>This is a python demo page</title></head>

vSoup.html.parent # html的父标签是自身

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

vSoup.parent # vSoup是特殊对象，其父标签为空

### 查看上行标签的名字
import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")
for parent in vSoup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]

平行遍历

属性	说明	数据类型
.next_sibling	返回按照html文本顺序的下一个平行节点标签	list
.previous_sibling	返回按照html文本顺序的上一个平行节点的标签	list
.next_siblings	迭代类型，返回按照html文本顺序的后续所有平行节点标签	迭代类型
.previous_siblings	迭代类型，返回按照html文本顺序的前序所有平行节点标签	迭代类型

平行遍历必须发生在同一个父节点下。

import requests
from bs4 import BeautifulSoup
vUrl = requests.get("http://python123.io/ws/demo.html")
vDemo = vUrl.text
vSoup = BeautifulSoup(vDemo, "html.parser")

vSoup.a.next_sibling # 忘后一个平行节点是一个字符串

' and '

vSoup.a.next_sibling.next_sibling #往后两个平行节点也是一个a标签

<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

vSoup.a.previous_sibling # 前一个平行节点是一段文字

'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

vSoup.a.previous_sibling.previous_sibling # 前一个的前一个不存在，所以没有输出

标签数的平行遍历：

for vSibling in vSoup.a.next_siblings:
    print(vSibling)

 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.

for vSibling in vSoup.a.previous_siblings:
    print(vSibling)

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Python爬虫入门-3.Beautiful Soup库入门

1. 安装

2.Beautifulsoup的元素

3. Beautiful Soup元素的遍历

猜你喜欢

热点阅读