beautiful soup解析

2018-06-12 本文已影响12人北游_

find(tag, attributes, recursive, text, keywords)

find：匹配从上到下第一个出现的值

findAll(tag, attributes, recursive, text, limit, keywords)

findAll方法 参数说明：

tag：传入一个标签的名称或多个标签名组成的 python 列表
- .findAll(a)
- .findAll(['h1','h2','h3','h4'])
attributes：使用一个 python 字典封装一个标签的若干属性和对应的属性值。tag参数需要加入，不特指的话，可以给空字符串
- .findAll('span',{'class':{'green','red'}})
recursive：该递归参数是一个布尔变量。值为 True ，表示递归查询（子标签+后代标签）；值为 Flase ，表示仅查询该文档的一级标签。默认值为 True 。
text：使用标签的文本内容去匹配，而不是使用标签的属性。假如我们想查找网页文档中包含 “the price” 内容的标签数量。例子如下
- nameList = html.findAll(text='the price')
- 其 nameList 变量的返回值是一个整型值，表示匹配到的包含该内容的标签的数量
limit：只能用于 findAll 方法，find 方法其实等价于 limit 等于1。表示匹配网页文档中的前 x 项结果
keyword：选择具有指定属性的标签（标签可以不一样。）
- .findAll(id='text') 等价于 .findAll('',{'id','text'})
- 特殊点：
  - 因为 class 是 python 中受保护的关键字，也就是说 python 程序中不是使用class作为变量或参数名
  - Beautiful Soup 在 keyword 参数方面提供了以下解决方案：使用 class_
  - 例如：.findAll(class_='green')

# get_text() 获取指定标签的文本内容，注意括号不能少
bsObj.find('img',{'src': '……/1.jpg'}).parent.get_text()

导航树

将整个 Html 页面可以映射为一棵树。

# # 子标签处理
# children 方法：所有子标签
bsObj.find('table', {'id': 'giftList'}).children

# child 方法：单个子标签
bsObj.find('table', {'id': 'giftList'}).child

# # 后代标签
# descendant 方法:指定父标签下的所有级别的标签
bsObj.find('table', {'id': 'giftList'}).descendant

# # 兄弟标签处理（处理带有标题行的标签，相当方便）
# next_siblings 方法：后面的所有兄弟标签
bsObj.find('table', {'id': 'giftList'}).next_siblings

# next_sibling 方法：后面的单个兄弟标签
bsObj.find('table', {'id': 'giftList'}).next_sibling

# previous_siblings 方法：前面的所有兄弟标签
bsObj.find('table', {'id': 'giftList'}).previous_siblings

# previous_sibling 方法：前面的单个兄弟标签
bsObj.find('table', {'id': 'giftList'}).previous_sibling

# # 父标签处理
bsObj.find('img',{'src': '……/1.jpg'}).parent

beautiful soup解析

猜你喜欢

热点阅读