BeautyfulSoup/python3基本使用
2018-11-10 本文已影响0人
疯帮主
简单开始
# 这个代码是不完整的,有些没有闭合标签
html = """
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载
<link rel="icon" href="/static/favicon.ico">
<link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
</head>
<body>
<header>
<div class="header-box">
<div class="container">
<span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
# 格式化代码,其实也不是很好用,闭合不准确
print(soup.prettify())
print(soup.title.string)
print(soup.span.string)
输出:
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8"/>
<title>
迅影网,迅雷电影下载,最新电影下载,高清电影下载
<link href="/static/favicon.ico" rel="icon"/>
<link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
</title>
</head>
<body>
<header>
<div class="header-box">
<div class="container">
<span class="header-help">
欢迎来到迅影网,一起分享电影给我们带来的快乐。
</span>
<div class="pull-right">
<a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">
Ctrl+D 加入收藏夹
</a>
-
<a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
保存到桌面
</a>
</div>
</div>
</div>
</header>
</body>
</html>
None
欢迎来到迅影网,一起分享电影给我们带来的快乐。
标签选择器
选择元素
html = """
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
<link rel="icon" href="/static/favicon.ico">
<link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
</head>
<body>
<header>
<div class="header-box">
<div class="container">
<span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(type(soup.title))
print(soup.title)
print(soup.head)
print(soup.link)
输出:
<class 'bs4.element.Tag'>
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
<head>
<meta charset="utf-8"/>
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
<link href="/static/favicon.ico" rel="icon"/>
<link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
</head>
<link href="/static/favicon.ico" rel="icon"/>
当有相同的标签时,会选第一个
获取名称
html = """
<link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.link.name)
输出:
link
获取属性
html = """
<link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.link['rel'])
print(soup.link.attrs['rel'])
输出:
['stylesheet']
['stylesheet']
获取内容
html = """
<div>
<b>在这</b>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.string)
html = """
<div><b>在这</b></div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.string)
输出:
None
在这
一个换行就匹配不到了
嵌套选择
html = """
<div>
<b>在这</b>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.b.string)
输出:
在这
获取子节点
使用contents
html = """
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.contents)
输出:
['\n', <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>, ' -\n ', <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>, '\n']
每个标签和每个标签间的字符都是一个元素
使用children
html = """
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.children)
for i,child in enumerate(soup.div.children):
print(i, child)
输出:
<list_iterator object at 0x000001EC5AC591D0>
0
1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
2 -
3 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
4
children返回的是一个迭代器
enumerate返回迭代索引和内容
使用返回子孙节点
html = """
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">
<span>
<b>保存到桌面<b>
</span>
</a>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.descendants)
for i,child in enumerate(soup.div.descendants):
print(i, child)
输出:
<generator object Tag.descendants at 0x000001EC5AC66A20>
0
1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
2 Ctrl+D 加入收藏夹
3 -
4 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
<span>
<b>保存到桌面<b>
</b></b></span>
</a>
5
6 <span>
<b>保存到桌面<b>
</b></b></span>
7
8 <b>保存到桌面<b>
</b></b>
9 保存到桌面
10 <b>
</b>
11
12
13
父节点
单个父节点
html = """
<html>
<body>
<div class="pull-right">
<a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
<a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
</div>
</body>
</html>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.parent)
输出:
<body>
<div class="pull-right">
<a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a> -
<a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
</div>
</body>
祖父节点
html = """
<html>
<body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
</html>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.p.parents)
for i, parent in enumerate(soup.p.parents):
print(i, parent)
输出:
<generator object PageElement.parents at 0x000001EC5AD93F48>
0 <div>
<p>I am</p>
<p>Here</p>
</div>
1 <body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
2 <html>
<body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
</html>
3 <html>
<body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
</html>
兄弟节点
html = """
<div>
<p>I am here?</p>
<p>Where are you now?</p>
<P>See you late</p>
<p>You are my sunshine</p>
<p>How much I love you</p>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
# 小兄弟节点
print(list(enumerate(soup.p.next_siblings)))
# 大兄弟节点
print(list(enumerate(soup.p.previous_siblings)))
输出:
[(0, '\n'), (1, <p>Where are you now?</p>), (2, '\n'), (3, <p>See you late</p>), (4, '\n'), (5, <p>You are my sunshine</p>), (6, '\n'), (7, <p>How much I love you</p>), (8, '\n')]
[(0, '\n')]
标准选择器
find_all(name, attrs, recursive, text, **kwargs)
name标签名
html = """
<div>
<p>I am here?</p>
<p>Where are you now?</p>
<P>See you late</p>
<p>You are my sunshine</p>
<p>How much I love you</p>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all('p'))
print(type(soup.find_all('p')))
print(soup.find_all('p')[0])
print(type(soup.find_all('p')[0]))
输出:
[<p>I am here?</p>, <p>Where are you now?</p>, <p>See you late</p>, <p>You are my sunshine</p>, <p>How much I love you</p>]
<class 'bs4.element.ResultSet'>
<p>I am here?</p>
<class 'bs4.element.Tag'>
attrs属性
html = """
<div class="item active">
<a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
<div class="carousel-caption">反贪风暴3 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
<div class="carousel-caption">黄金兄弟 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
<div class="carousel-caption">江湖儿女 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
<div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
</div></div>"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'href': 'http://www.xunyingwang.com/movie/430296.html'}))
print(soup.find_all(attrs={"class": 'carousel-caption'}))
print(soup.find_all(class_='carousel-caption'))
输出:
[<a href="http://www.xunyingwang.com/movie/430296.html" target="_blank"><img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/></a>]
[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]
[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]
text文本内容
html = """
<div class="item active">
<a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
<div class="carousel-caption">反贪风暴3 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
<div class="carousel-caption">黄金兄弟 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
<div class="carousel-caption">江湖儿女 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
<div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div></div>"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all(text='超人总动员2 迅雷下载'))
输出:
['超人总动员2 迅雷下载', '超人总动员2 迅雷下载']
find方法
html = """
<div class="item active">
<a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
<div class="carousel-caption">反贪风暴3 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
<div class="carousel-caption">黄金兄弟 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
<div class="carousel-caption">江湖儿女 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
<div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div></div>"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find(text='超人总动员2 迅雷下载'))
输出:
超人总动员2 迅雷下载
CSS选择器
html = """
<div class="item active">
<a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
<div class="carousel-caption">反贪风暴3 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
<div class="carousel-caption">黄金兄弟 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
<div class="carousel-caption">江湖儿女 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
<div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
</div> <div class="item">
<a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
<div class="carousel-caption">超人总动员2 迅雷下载</div>
</div></div>"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.select(".item .carousel-caption"))
print(soup.select(".item a img"))
print(soup.select(".item div")[2].get_text())
print(soup.select(".item a img")[4]['alt'])
输出:
[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>]
[<img alt="反贪风暴3 迅雷下载" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" width="100%"/>, <img alt="黄金兄弟 迅雷下载" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>, <img alt="江湖儿女 迅雷下载" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" width="100%"/>, <img alt="蚁人2:黄蜂女现身 迅雷下载" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>]
超人总动员2 迅雷下载
蚁人2:黄蜂女现身 迅雷下载
参考文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html