Crawler

BeautyfulSoup/python3基本使用

2018-11-10  本文已影响0人  疯帮主

简单开始

# 这个代码是不完整的,有些没有闭合标签
html = """
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="utf-8">
    <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载
    <link rel="icon" href="/static/favicon.ico">
    <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
</head>
<body>
<header>
<div class="header-box">
    <div class="container">
        <span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
        <div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面
        </div>

"""
soup = bs4.BeautifulSoup(html, 'lxml')
# 格式化代码,其实也不是很好用,闭合不准确
print(soup.prettify())
print(soup.title.string)
print(soup.span.string)

输出:

<!DOCTYPE html>
<html lang="zh-CN">
 <head>
  <meta charset="utf-8"/>
  <title>
   迅影网,迅雷电影下载,最新电影下载,高清电影下载
   <link href="/static/favicon.ico" rel="icon"/>
   <link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  </title>
 </head>
 <body>
  <header>
   <div class="header-box">
    <div class="container">
     <span class="header-help">
      欢迎来到迅影网,一起分享电影给我们带来的快乐。
     </span>
     <div class="pull-right">
      <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">
       Ctrl+D 加入收藏夹
      </a>
      -
      <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
       保存到桌面
      </a>
     </div>
    </div>
   </div>
  </header>
 </body>
</html>
None
欢迎来到迅影网,一起分享电影给我们带来的快乐。

标签选择器

选择元素

html = """
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="utf-8">
    <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
    <link rel="icon" href="/static/favicon.ico">
    <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
</head>
<body>
<header>
<div class="header-box">
    <div class="container">
        <span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
        <div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
        </div>

"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(type(soup.title))
print(soup.title)
print(soup.head)
print(soup.link)

输出:

<class 'bs4.element.Tag'>
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
<head>
<meta charset="utf-8"/>
<title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
<link href="/static/favicon.ico" rel="icon"/>
<link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
</head>
<link href="/static/favicon.ico" rel="icon"/>

当有相同的标签时,会选第一个

获取名称

html = """
    <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.link.name)

输出:

link

获取属性

html = """
    <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.link['rel'])
print(soup.link.attrs['rel'])

输出:

['stylesheet']
['stylesheet']

获取内容

html = """
<div>
<b>在这</b>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.string)

html = """
<div><b>在这</b></div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.string)

输出:

None
在这

一个换行就匹配不到了

嵌套选择

html = """
<div>
<b>在这</b>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.b.string)

输出:

在这

获取子节点

使用contents

html = """
<div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
        </div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.contents)

输出:

['\n', <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>, ' -\n            ', <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>, '\n']

每个标签和每个标签间的字符都是一个元素

使用children

html = """
<div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
        </div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.children)
for i,child in enumerate(soup.div.children):
    print(i, child)

输出:

<list_iterator object at 0x000001EC5AC591D0>
0 

1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
2  -
            
3 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
4 

children返回的是一个迭代器
enumerate返回迭代索引和内容

使用返回子孙节点

html = """
<div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">
            <span>
            <b>保存到桌面<b>
            </span>
            </a>
        </div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.descendants)
for i,child in enumerate(soup.div.descendants):
    print(i, child)

输出:

<generator object Tag.descendants at 0x000001EC5AC66A20>
0 

1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
2 Ctrl+D 加入收藏夹
3  -
            
4 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
<span>
<b>保存到桌面<b>
</b></b></span>
</a>
5 

6 <span>
<b>保存到桌面<b>
</b></b></span>
7 

8 <b>保存到桌面<b>
</b></b>
9 保存到桌面
10 <b>
</b>
11 

12 

13 

父节点

单个父节点

html = """
<html>
<body>
<div class="pull-right">
            <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
        </div>
</body>
</html>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.div.parent)

输出:

<body>
<div class="pull-right">
<a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a> -
            <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
</div>
</body>

祖父节点

html = """
<html>
    <body>
        <div>
            <p>I am</p>
            <p>Here</p>
        </div>
    </body>
</html>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.p.parents)
for i, parent in enumerate(soup.p.parents):
    print(i, parent)

输出:

<generator object PageElement.parents at 0x000001EC5AD93F48>
0 <div>
<p>I am</p>
<p>Here</p>
</div>
1 <body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
2 <html>
<body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
</html>
3 <html>
<body>
<div>
<p>I am</p>
<p>Here</p>
</div>
</body>
</html>

兄弟节点

html = """
<div>
    <p>I am here?</p>
    <p>Where are you now?</p>
    <P>See you late</p>
    <p>You are my sunshine</p>
    <p>How much I love you</p>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
# 小兄弟节点
print(list(enumerate(soup.p.next_siblings)))
# 大兄弟节点
print(list(enumerate(soup.p.previous_siblings)))

输出:

[(0, '\n'), (1, <p>Where are you now?</p>), (2, '\n'), (3, <p>See you late</p>), (4, '\n'), (5, <p>You are my sunshine</p>), (6, '\n'), (7, <p>How much I love you</p>), (8, '\n')]
[(0, '\n')]

标准选择器

find_all(name, attrs, recursive, text, **kwargs)

name标签名

html = """
<div>
    <p>I am here?</p>
    <p>Where are you now?</p>
    <P>See you late</p>
    <p>You are my sunshine</p>
    <p>How much I love you</p>
</div>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all('p'))
print(type(soup.find_all('p')))
print(soup.find_all('p')[0])
print(type(soup.find_all('p')[0]))

输出:

[<p>I am here?</p>, <p>Where are you now?</p>, <p>See you late</p>, <p>You are my sunshine</p>, <p>How much I love you</p>]
<class 'bs4.element.ResultSet'>
<p>I am here?</p>
<class 'bs4.element.Tag'>

attrs属性

html = """
<div class="item active">
        <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
        <div class="carousel-caption">反贪风暴3 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
        <div class="carousel-caption">黄金兄弟 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
        <div class="carousel-caption">江湖儿女 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
        <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
    </div></div>"""

soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'href': 'http://www.xunyingwang.com/movie/430296.html'}))
print(soup.find_all(attrs={"class": 'carousel-caption'}))
print(soup.find_all(class_='carousel-caption'))

输出:

[<a href="http://www.xunyingwang.com/movie/430296.html" target="_blank"><img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/></a>]
[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]
[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]

text文本内容

html = """
<div class="item active">
        <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
        <div class="carousel-caption">反贪风暴3 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
        <div class="carousel-caption">黄金兄弟 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
        <div class="carousel-caption">江湖儿女 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
        <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div></div>"""

soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find_all(text='超人总动员2 迅雷下载'))

输出:

['超人总动员2 迅雷下载', '超人总动员2 迅雷下载']

find方法

html = """
<div class="item active">
        <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
        <div class="carousel-caption">反贪风暴3 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
        <div class="carousel-caption">黄金兄弟 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
        <div class="carousel-caption">江湖儿女 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
        <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div></div>"""

soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.find(text='超人总动员2 迅雷下载'))

输出:

超人总动员2 迅雷下载

CSS选择器

html = """
<div class="item active">
        <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
        <div class="carousel-caption">反贪风暴3 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
        <div class="carousel-caption">黄金兄弟 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
        <div class="carousel-caption">江湖儿女 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
        <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
    </div>    <div class="item">
        <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
        <div class="carousel-caption">超人总动员2 迅雷下载</div>
    </div></div>"""

soup = bs4.BeautifulSoup(html, 'lxml')
print(soup.select(".item .carousel-caption"))
print(soup.select(".item a img"))
print(soup.select(".item div")[2].get_text())
print(soup.select(".item a img")[4]['alt'])

输出:

[<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>]
[<img alt="反贪风暴3 迅雷下载" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" width="100%"/>, <img alt="黄金兄弟 迅雷下载" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>, <img alt="江湖儿女 迅雷下载" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" width="100%"/>, <img alt="蚁人2:黄蜂女现身 迅雷下载" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>]
超人总动员2 迅雷下载
蚁人2:黄蜂女现身 迅雷下载

参考文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html

上一篇下一篇

猜你喜欢

热点阅读