Python_Scrapy-第三方模块安装与使用

2019-01-19  本文已影响0人  Just_do_1995

第三方模块的安装

1、request库的安装与使用

requests库本质上就是模拟了我们用浏览器打开一个网页,发起请求是的动作。它能够迅速的把请求的html源文件保存到本地

  1. 首先我们先导入requests这个包

    import requests</br></br>

我们来吧百度的index页面的html源码抓取到本地,并用r变量保存</br>
注意这里,网页前面的http://一定要写出来,它并不能像真正的浏览器一样帮我们补全http协议

r = requests.get("http://www.baidu.com")

将下载到的内容打印一下:

print(r.text)

  1. 所获取的百度源码文件

2、bs4库的安装与使用

bs4库 是解析、遍历、维护、“标签树“的功能库。

<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
    http://example.com/elsie" class="sister" id="link1">Elsie,
    http://example.com/lacie" class="sister" id="link2">Lacie and
    http://example.com/tillie" class="sister" id="link3">Tillie;
    and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>

2、下面我们开始用bs4库解析这一段html网页代码。

#导入bs4模块
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html.parser')
#输出结果
print(soup.prettify())

'''
OUT:

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>
'''

通俗一点说就是: bs4库把html源代码重新进行了格式化,
从而方便我们对其中的节点、标签、属性等进行操作。

3、BS4库的解析器的安装与使用

我们所选用的是lxml解析器

import bs4


#首先我们先将html文件已lxml的方式做成一锅汤
soup = bs4.BeautifulSoup(open('Beautiful Soup 爬虫/demo.html'),'lxml')

#我们把结果输出一下,是一个很清晰的树形结构。
#print(soup.prettify())

'''
OUT:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
'''
上一篇下一篇

猜你喜欢

热点阅读