《python网络数据采集》——第二天

2018-07-16 本文已影响5人三横一竖是我

7-16

Beautifulsoup库

bs0bj =BeautifulSoup(html.read())
运行之后系统会进行警告

The code that caused this warning is on line 4 of the file F:/pythonwork/a01.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")

这是没有设置默认解析器造成的，按照提示把代码更改一下即可

find（）和 findAll（）使用
两个函数主要使用两个参数tag和attributes
标签参数tag
.findAll({"h1","h2"})
属性参数attributes 使用python字典封装的标签若干属性对应的属性值
.findAll("span",{"class":{"green","red"})

服务器异常处理方法

第一种是程序返回HTTP错误，可能是404NotFound，所以类似情况，urlopen函数会出现“HTTPError”

try:
    html = urlopen("http://")
except HTTPError as e:
    print(e)
else 
    #程序继续

如果服务器不存在，就是网址写错了，urlopen会返回none

if html is None
  print("url is not found")
else:
    #程序继续

当调用Beautifulsoup对象里的标签，以防不存在，最好增加一个检查条件
当对象不存在，在调用就会返回一个异常AttributeError: 'Nonetype' object has no attribute 'someTag'
重新组织合并，创建getitle函数

 def getTiltle(url):
        try:
            html = urlopen(url)
        except HTTPError as e
            return None
        try:
            bs0bj = BeautifulSoup(html.read())
            title = bs0bj.body.h1
        except AttributeError as e:
            return None
        return title
title = getTiltle("")
if title == None:
    print("title cound not found")
else:
    print(title)

正则表达式

用于浏览大量文档，如果给的字符串符合规则就返回他，不符合就忽略他，可以用一系列线性规则构成的字符串
aabbbbb(cc)(d | ）
a后面跟着a表示至少出现一次，然后5次b，（cc）表示有任意次两个c,(d |)表示最后一个可以使空格也可以是字符
正则表达式的经典应用就是邮箱识别
网上的学习教程http://www.runoob.com/regexp/regexp-tutorial.html
正则表达式可以让beautifulsoup的findall等使用的更灵活，也就是类似于过滤的效果

获取标签的属性myTag.atters["scr"]这行代码返回的是字典对象

Lambda表达式

本质上是一个函数，可以作为其他函数的变量使用beautifulsoup允许把特定函数作为参数，但是要求必须把一个标签作为参数且返回结果是布尔类型
soup.findAll(lambda tag: len(tag.attrs) == 2)
获取有两个属性的标签

《python网络数据采集》——第二天

Beautifulsoup库

服务器异常处理方法

正则表达式

Lambda表达式

猜你喜欢

热点阅读