爬虫学习笔记

2020-03-03 本文已影响0人 zfz_amzing

爬虫

Python复习

数据类型

列表

#列表可以重新赋值
a = [11,'avc',1.1]

元组 tuple

#元组不可以重新赋值
b = (11,'avc',1.1)

字典

#相当于Java中的map
c = {'key': 'value','abc',7}

集合

d = set("asdffa") 
e = {'a','c','d'}
#差集
d-e
#并集
d|e
#交集
d&e

文件操作

open(文件地址,操作形式)

w：写入 r：读取 b：二进制 a：追加
```
#"D:\Python\1.txt" 要转换成"D:\\Python\\1.txt"或者"D:/Python/1.txt"
fh = open("D:\\Python\\1.txt", "r")
data = fh.read()
print(data)
```
遇到的问题：执行下列代码 readline() 并没有输出内容
```
fh = open("D:\\Python\\1.txt", "r", encoding="utf-8")
data = fh.read()
print(data)
line = fh.readline()
print(line)
fh.close()
```
解决：readline()函数进行读取是根据光标的位置来读，由于data = fh.read()将整个文件读取完，光标在文件最后，所以readline读不到内容

w在文件关闭后会保存，再次打开后写操作都会覆盖掉之前的内容

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()

# 文件内容：快活啊

改为a后会在之前的内容追加写

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "a", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()
# 文件内容：来啊快活啊

异常处理

异常处理的格式

'''
try:
    程序
except Exception as 异常名称 :
    异常处理部分
'''
try:
    for i in range(0, 10):
        print(i)
        if i == 2:
            print(1/0)
    print("hello")
except Exception as err:
    print(err)

面向对象

类中的方法第一个参数是 self 的才可以被实例调用

class Father:
    def speak(self):
        print("I can speak")


class Mother:
    def write(self):
        print("I can write")


class Son(Father):
    def speak(self):
        print("I can speak well")


class Daughter(Father, Mother):
    pass


father = Father()
father.speak()
son = Son()
son.speak()
daughter = Daughter()
daughter.speak()
daughter.write()


# I can speak
# I can speak well
# I can speak
# I can write

正则表达式

正则表达式可以对数据进行筛选，提取出我们关注的信息

原子

使用正则表达式依赖re模块

原子时正则表达式中最基本的单位，每个正则表达式中至少要包含一个原子，常见的原子类型有：

普通字符作为原子

import re

string = "xixihaha"
# 普通字符作为原子
pat = "ih"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(3, 5), match='ih'>

非打印字符作为原子

# 非打印字符做原子
# \n 换行符 \t 制表符
string = '''xixi
haha
'''
pat = "\n"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(4, 5), match='\n'>

通用字符作为原子
- \w 匹配字母、数字、下划线 \W 匹配除字母、数字、下划线之外的
- \d 匹配十进制数字 \D匹配除十进制数字之外的
- \s 空白字符 \S除空白字符之外的
```
#通用字符作原子
string = '''xixiha123haha'''
pat = "\d\d"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(5, 8), match='a12'>
```

原子表将不同的原子组成表

string = '''zhangfuzhi'''
pat = "zhang[efg]u" # 原子表[efg]中包含'e' 'f' 'g'，如果这三个原子中有匹配的则提取出来
result = re.search(pat, string)
print(result)
# <re.Match object; span=(0, 7), match='zhangfu'>
# 反例：
string = '''zhangfuzhi'''
pat = "zhang[eg]u"
result = re.search(pat, string)
print(result)
# None

元字符

元字符就是正则表达式中一些具有特殊含义的字符，比如重复N此次前面的字符等

. 匹配除换行符以外任意一个字符
^ 如果不在原子表里面代表匹配开始位置，在原子表里代表非
$ 匹配结束位置
* 前面的原子重复出现零次一次或多次
? 前面的原子出现一次或零次
+ 前面的原子出现一次或多次
{n} 前面原子恰好出现n次
{n,} 前面原子至少出现n次
{n,m} 前面原子至少出现n次，之多出现m次
| 模式选择符“或”
() 模式单元

模式修正符

I 忽略大小写
M 多行匹配
L 本地化识别匹配
U 根据Unicode进行解析
S 让.匹配换行符

string = '''Python'''
pat = "pyt"
result = re.search(pat, string, re.I)
print(result)
# <re.Match object; span=(0, 3), match='Pyt'>

贪婪模式与懒惰模式

贪婪模式的核心就是尽可能多的匹配，懒惰模式的核心点时尽可能少的匹配。默认是贪婪模式

string = '''Pythony'''
pat = "p.*y"  # 贪婪模式
pat2 = "p.*?y"  # ?代表使用懒惰模式
result = re.search(pat, string, re.I)
result2 = re.search(pat2, string, re.I)
print(result)
print(result2)

# <re.Match object; span=(0, 7), match='Pythony'>
# <re.Match object; span=(0, 2), match='Py'>

正则表达式函数

match() 从头开始匹配

string = '''Pythony'''
pat = "p.*y"
pat2 = "o.*?y"
result = re.match(pat, string, re.I)
result2 = re.match(pat2, string, re.I)
print(result)
print(result2)

# <re.Match object; span=(0, 7), match='Pythony'>
# None

seach() 任何地方都可以匹配

全局匹配函数格式：re.compile(正则表达式).findall(数据)

string = '''pythonypouyppypady'''
pat = "p.*?y"
result = re.compile(pat).findall(string)
print(result)

正则实例

匹配.com或.cn

string = "<a href='http://www.baidu.com'>百度</a><a href='http://www.jd.com'>百度</a>"
pat = "[a-zA-Z]+://[^\s]*[.com|.cn]"
result = re.compile(pat).findall(string)
print(result)
# ['http://www.baidu.com', 'http://www.jd.com']

匹配电话号码

string = "dashd0534-5657888dasbd a0534-5325695asdshgiusae//.001-12345678"
pat = "\d{4}-\d{7}|\d{3}-\d{8}"
result = re.compile(pat).findall(string)
print(result)
# ['0534-5657888', '0534-5325695', '001-12345678']

简单练手

爬取出版社信息并写入文件

url = "https://read.douban.com/provider/all"
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62'}
ret = urllib.request.Request(url, headers=header)
res = urllib.request.urlopen(ret)
data = res.read().decode('utf-8')

# <div class="name">重庆大学出版社</div>
pat = ">(.{1,10}?出版社)"
result = re.compile(pat).findall(data)
print(result)
fh = open("D:\\Python\\1.txt", "w")
for i in result:
    fh.write(i + '\n')
fh.close()

Requests库

get post请求

import requests

# GET请求
r = requests.get('http://httpbin.org/get')
print(r.status_code, r.reason)
print(r.text)

# POST请求
r = requests.post('http://httpbin.org/post', data={'a': '1'})
print(r.json())

httpbin是一个HTTP Request & Response Service，你可以向他发送请求，然后他会按照指定的规则将你的请求返回

自定义header请求

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62'
headers = {'User-Agent': ua}
r = requests.get('http://httpbin.org/get', headers=headers, params={'c': '3'})
print(r.json())

带cookies的请求

cookies = dict(usrid='12345', token='hhhhhhh')
r = requests.get('http://httpbin.org/cookies', cookies=cookies)
print('带cookies的请求', r.json())
# 带cookies的请求 {'cookies': {'token': 'hhhhhhh', 'usrid': '12345'}}

Basic-auth认证请求

r = requests.get('http://httpbin.org/basic-auth/zfz/123456', auth=('zfz', '123456'))
print('Basic-auth认证请求', r.json())
# Basic-auth认证请求 {'authenticated': True, 'user': 'zfz'}

主动抛出状态码异常

bad_r = requests.get('http://httpbin.org/404')
print(bad_r.status_code)
bad_r.raise_for_status()

使用request.Session对象

# 创建一个Session对象
s = requests.Session()
# Session对象会保存服务器返回的set-cookies头信息里面的内容
s.get('http://httpbin.org/cookies/set/userid/123456')
# 下一次请求会将本地所有的cookies信息自动添加到头信息里面
r = s.get('http://httpbin.org/cookies')
print('检查session中的cookies', r.json())

# 检查session中的cookies {'cookies': {'userid': '123456'}}

BeautifulSoup库

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

选择所有title标签

soup.title
# <title>The Dormouse's story</title>

title标签的文本内容

soup.title.text
# "The Dormouse's story"

取出第一个a标签的所有属性

soup.a.attrs
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

取出a标签的href属性

soup.a.attrs['href']
# 'http://example.com/elsie'

判断是否有href属性
```
soup.a.has_attr('class')
```
取出第一个p标签下的所有子节点，取出的是一个迭代器，需要用list转换
```
list(soup.p.children)
```

取出本页的所有链接

for a in soup.find_all('a'):
    print(a.attrs['href'])
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

取出id=link3的节点

soup.find(id='link3')
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

bs支持css选择器

soup.select('.story')
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]

xpath

xpath是一门在xml文档中查找信息的语言

节点（node）

元素、属性、文本、命名空间、文档（根）节点

节点关系

父（parent）
子（children）
同胞（sibling）
先辈（ancestor）
后代（descendant）

xpath语法

表达式	描述
nodename	选取此节点的所有子节点
/	从当前节点选区直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性