第二章 BeautifulSoup库

2018-06-11 本文已影响0人 sszhang

from bs4 import BeautifulSoup

2.1 理解标签树

<p class = ‘title’> ...</p>
p 为标签 tag， class=title表示属性attributes，0个或多个

2.2 BeautiflulSoup 解析器

BeautifulSoup(mk, 'html.parser') : bs4的HTML解析器
BeautifulSoup(mk, 'lxml')： lxml的HTML解析器
BeautifulSoup(mk, 'xml') ： lxml的xml解析器
BeautifulSoup(mk, 'html5lib')： html5lib的解析器

2.3 BeautiflulSoup 类的基本元素

Tag：标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name：标签名字，<p>...</p>的名字是p，格式是<tag>.name
Attributes: 标签的属性，字典形式组织，格式是<tag>.attrs
NavigableString：标签内非属性字符串，<>...</>中的字符串，格式 <tag>.string
Comment：标签内字符串的注释部分，一种特殊的comment类型

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, 'html.parser')
tage_title = soup.title
# <title> This is a python demo </title>
tag_a = soup.a
# <a class = 'py1', href = 'http://www.icourse163.org'/coursers/BIT-26001,id ='link1'> Basic Python </a>
soup.a.name
# 'a'
soup.a.parent.name
# 'p'
tag_a.attrs
#[class:py1, href: 'http://www.icourse163.org'/coursers/BIT-26001, id:'link1']
tag_a.attrs['class']
# py1
soup.a.string
# Basic python

newsoup = BeautifulSoup('<b><!--This is a comment--></b>'<p>This is not a comment</p>, 'html.parser')
newsoup.b.string
# this is  a comment
type(newsoup.b.string)
#<class 'bs4.element.Comment'>
newsoup.p.string
'This is not a comment'
type(newsoup.a.string)
<class 'bs4.element.NavigableString'>

2.4 prettify()函数整理标签

prettify()为文本增加换行符

from bs4 import BeautifulSoup
soup = BeatifulSoup(demo, 'html.parser')
print(soup.prettify())
print(soup.a.prettify())

2.5 标签树遍历

下行遍历
.contents：子节点的列表，将<tag>所有儿子节点存入列表
.children：子节点的迭代类型，与.conents类型，用于循环遍历儿子节点
.descendants：子孙节点的迭代类型，包含所有的子孙节点，用于循环遍历

soup = BeautifulSoup(demo,'html.parser')
soup.head
soup.head.contents
soup.body.contents
# 返回的是列表
for child in soup.body.children
 print(child)
for descendant in soup.body.descendants
 print(descendant)

上行遍历
.parent 节点的父亲标签, 返回的不是列表
.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点
如果有parent，将parent中间节点全部返回，如果没有上面的parent，将自己全部返回

soup = beatifulSoup(demo, 'html.parser')
soup.title.parent
#<head><title>...</title><head>
soup.html.parent
#<html> ...</html>

soup = BeautifulSoup(demo, 'html.parser')
for parent in soup.a.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
# p body html [doment]

平行遍历(必须位于同一个副节点,而且Navigable string也构成节点)
.next_sibings：返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling：返回按照HTML文本顺序的上一个平行节点标签
.next_siblings：迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings：迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

soup = BeautifulSoup(demo, 'html.parser')
soup.a.next_sibling
soup.a.next_sibling.next_sibling
soup.a.previous_sibling
soup.a.previous_sibling.previous_sibling
#返回可能是字符串，可能是None

for sibling in soup.a.next_siblings:
  print(sibling)
for sibling in soup.a.previous_siblings:
  print(sibling)

2.6 信息标记的三种方式(XML, JSON, YAML)

XML
有Tag 和 attributes 表示

<img src = 'China.jpg', size = 10>...</img>
或者
<img src = 'China.jpg', size =10/>

JSON
有类型的键值对 key：value，必须增加双引号，表示字符串
如果多个值用 [ ]
可以用 { } 嵌套

'key' : 'value'
'key' : ['value', 'value']
'key': {'subkey' : 'subvalue'}

YAML
无类型的键值对 key ： value
用 indent 表示从属关系
用 - 表示并列关系
用 | 表示注释

key : value
key: # comment
-value1
-value 2
key:
subkey: subvalue

2.7 信息提取的方法

方法一：完整解析信息的标记形式，在提取关键信息
优点：信息解析准确
缺点：提取过程繁琐，速度慢
方法二：无视标记信息，直接搜索关键信息
优点：提取过程简洁
缺点：提取结果准确性与内容相关
方法三：综合方法
需要标记解析器及文本查找函数

from bs4 import BeautifulSoup
import requests

r = requests.get('www.python123.io/ws/demo.html')
demo = r.text
soup = BeatifulSoup(demo, 'html.parser')
for link in soup.find_all('a'):
  print(link.get('href'))

from bs4 import BeautifulSoup
import re
soup =

2.8 BeautifulSoup 方法

<>.find_all(name, attrs, recursive, string, **kwargs)
返回列表类型，储存查找的结果
name ：对标签名的检索字符串
attrs: 对标签属性值的字符串检索
recursive: 是否对子孙全部检索，默认是True
string： <>..</>中字符串区域对检索字符串

soup.find_all('a')
soup.find_all(['a', 'b'])
for tag in soup.find_all(True):
    print(tag.name)
for tag in soup.finda_all(re.compile('b')):
  print(tag.name)

soup.find_all('p', 'cousre')
soup.find_all(id = 'link1')

soup.find(string = 'Basic Python')
soup.find_all(string = re.compile('python'))

find_all 的简化形式，可以直接默认缺失
<tag> (..) = <tag>.find_all(...)
soup(...) = soup.find_all(...)

2.9 BeautifulSoup 其它扩展方法

<>.find() 搜索只返回一个结果，字符串类型，
<>.find_parents()：在先辈节点中搜索，返回列表类型，
<>.find_parent()：在先辈节点中搜索一个结果，返回字符串类型，
<>.find_next_siblings()：在后续平行节点中搜索，返回列表类型
<>.find_next_sibling()：在后续平行节点中搜索一个结果，返回字符串类型，
<>.find_previous_siblings()：在前序平行节点中搜索，返回列表类型
<>.find_previous_sibling()：在前序平行节点中搜索一个结果，返回字符串类型，