python爬虫的简单使用

2021-09-07 本文已影响0人纵春水东流

1.需要的包

pip install urllib3#下载网页
pip install beautifulsoup4#解析网页与管理(增，删，改，查)网页数据

import urllib
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

2.提取信息

a =soup.find(class_='story')
print('(1):\n', a.get_text())

print("(2)\n",soup.find('p','story').get_text())

3、其他
(1)查

#html=urllib.urlopen('baidu.com')#urllib.open起到从百度下载网页数据的功能
soup=BeautifulSoup(html, 'lxml')#将html转化为BeautifulSoup对象

soup对象的结构

#直接输出整个网页
print(soup.prettify())

# print(soup.prettify())
#(1)标签的信息
print('第一个p标签的信息:',soup.p)#
print('第一个a标签的信息:',soup.a,     "\n")
#(2)输出标签名字
print("第一个标签的名字:",soup.p.name)
print("第一个a标签的名字无:",soup.a.anme)
print("第一个a标签的上一级标签的名字",soup.a.parent.name)
print("上上级标签的名字",soup.a.parent.parent.name,     "\n")
#(3)查看标签属性
print("查看标签的类型:",type(soup.a))
print("查看标签的所有属性",soup.a.attrs)
print("查看标签属性的类型",type(soup.a.attrs))
print("获取标签属性内容",soup.a.attrs['class'],         '\n')
#(4)出处标签内容
print('a标签的非属性字符串信息，表示尖括号之间的那部分字符串',soup.a.string)
print('查看标签string字符串的类型',type(soup.a.string))
print('p标签的字符串信息',soup.p.string,  '\n')#(注意p标签中还有个b标签，但是打印string时并未打印b标签，说明string类型是可跨越多个标签层次)
#(5)索引标签,find_all用法
#• name：对标签名称的检索字符串
#• attrs：对标签属性值的检索字符串，可标注属性检索
#• recursive：是否对子孙全部检索，默认True
#• string：<>…</>中字符串区域的检索字符串
print('所有a标签的内容: ',soup.find_all('a'))# 使用find_all()方法通过标签名称查找a标签,返回的是一个列表类型
print('a标签和b标签的内容: ',soup.find_all(['a','b']) ) # 把a标签和b标签作为一个列表传递，可以一次找到a标签和b标签

python爬虫的简单使用

猜你喜欢

热点阅读