爬虫笔记（3）：正则表达式与Cookie

2017-12-13 本文已影响0人 WeirdoSu

正则表达式基础知识：

原子：正则表达式中最基本的组成单位：

普通字符作为原子；
非打印字符:换行符\n，制表符\t；
通用字符：\w匹配任意一个字母、数字或下划线，\W除字母数字下划线意外任意字符，\d十进制数，\D十进制以外，\s空白；
原子表：[]，[^]代表除中括号里面所有字符。
元字符：
任意字符：.匹配除换行符意外任意字符；
边界限制：^开始位置，$结束位置；
限定符：*匹配0次或多次，?匹配0次或1次，+匹配1次或多次，{n}恰好n次，{n,}至少n次，{n,m}n到m次；
模式选择符：|模式选择符，选择模式；
模式单元符：()模式单元符，将小原子组合成大原子。

模式修正：

I：忽略大小写；
M：多行匹配；
L：本地化识别；
U：根据Unicode字符及解析字符；
S：让 . 匹配包括换行符；
例：re.search(pattern, string, re.I)

贪婪模式与懒惰模式：

贪婪模式：尽可能多的匹配；
懒惰模式：尽可能少的匹配；
默认是贪婪模式，相应位置加? 即可转换为懒惰模式。

常见函数：

re.match(pattern, string, [flag])：从第一个字符开始匹配；
re.search(pattern, string, [flag])：扫描整个字符串进行对应匹配；
全局匹配函数：思路：re.compile()预编译，使用findall()找出全部匹配结果；
re.sub(pattern, rep, string, [max])：根据模式pattern从string查找，并替换为rep，最多替换max次。

常见实例：

匹配`.com`或`.cn`后缀的`URL`网址：

In [1]: import re
In [2]: pattern = "[a-zA-Z]+://[^\s]*[.com|.cn]"
In [3]: string = "<a href='http:// www.baidu.com'>百度首页</a>"
In [4]: result = re.search(pattern, string)
In [5]: print(result)
None

匹配电话号码：

In [6]: pattern2 = "/d{4}-/d{7}|\d{3}-\d{8}"
In [7]: string2 = "021-23423432423423423423"
In [8]: result2 = re.search(pattern2, string2)
In [9]: print(result2)
<_sre.SRE_Match object; span=(0, 12), match='021-23423432'>

匹配电子邮件：

In [10]: pattern3 = "\w+([.+-]\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*"
In [11]: string3 = "mailto:c-e+o@iqi-anyue.com.cn"
In [12]: result3 = re.search(pattern3, string3)
In [13]: print(result3)
<_sre.SRE_Match object; span=(7, 29), match='c-e+o@iqi-anyue.com.cn'>

`Cookie`

什么是`Cookie`：

如果涉及登录操作，会用到Cookie；
HTTP是一个无状态协议，无法维持会话之间的状态；所以将登录成功等信息通过一些方式保存下来：Cookie或Session。

Cookie：将所有会话信息保存在客户端，访问同一网站其他页面时会从Cookie中读取对应的会话信息，判断会话状态；
Session：将会话信息保存在服务器端，但服务器端会给客户端发SessionID等，这些信息一般存在客户端的Cookie中；

`Cookiejar`实战精析：

如果要获得真实登录地址，需要对网页进行分析，方法有两种：

浏览器调试工具；
抓包工具；

`Python3`中的`cookiejar`库：

In [23]: url2 = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&l
    ...: oginsubmit=yes&loginhash=L768q"

In [24]: postdata2 = urllib.parse.urlencode({"username":"weisuen", "password":"
    ...: aA123456"}).encode('utf-8')

In [25]: req2 = urllib.request.Request(url2, postdata2)

In [26]: req2.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 
    ...: 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Sa
    ...: fari/537.36')

In [27]: res2 = urllib.request.urlopen(req2).read()

In [28]: filename2 = "desktop/programming/python_work/cookie_test.html"

In [29]: with open(filename2, "wb") as f:
    ...:     f.write(res2)
    ...:

实现思路：

导入Cookie处理模块http.cookiejar;
使用http.cookiejar.CookieJar()创建CookieJar对象；
使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象；
创建全局默认的opener对象；

In [30]: import http.cookiejar
# 模拟浏览器，传送数据
In [31]: url = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&lo
    ...: ginsubmit=yes&loginhash=L768q"
In [32]: postdata = urllib.parse.urlencode({"username":"weisuen", "password":"a
    ...: A123456"}).encode('utf-8')
In [33]: req = urllib.request.Request(url, postdata)
In [34]: req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10
    ...: _13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safa
    ...: ri/537.36')
# Cookie
In [35]: cjar = http.cookiejar.CookieJar()
In [36]: opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
In [37]: urllib.request.install_opener(opener)
In [38]: file = opener.open(req)
# 后续操作
In [39]: data = file.read()
In [40]: filename = "desktop/programming/python_work/cookie_test2.html"
In [41]: with open(filename, "wb") as f:
    ...:     f.write(data)
    ...:     

In [42]: url2 = "http://bbs.chinaunix.net/"
In [43]: data2 = urllib.request.urlopen(url2).read()
In [44]: with open("desktop/programming/python_work/cookie_test2_add.html", "wb") as f:
    ...:     f.write(data2)
    ...:

爬虫笔记（3）：正则表达式与Cookie

正则表达式基础知识：

原子：正则表达式中最基本的组成单位：

模式修正：

贪婪模式与懒惰模式：

常见函数：

常见实例：

匹配`.com`或`.cn`后缀的`URL`网址：

匹配电话号码：

匹配电子邮件：

`Cookie`

什么是`Cookie`：

`Cookiejar`实战精析：

`Python3`中的`cookiejar`库：

实现思路：

猜你喜欢

热点阅读

爬虫笔记（3）：正则表达式与Cookie

正则表达式基础知识：

原子：正则表达式中最基本的组成单位：

模式修正：

贪婪模式与懒惰模式：

常见函数：

常见实例：

匹配.com或.cn后缀的URL网址：

匹配电话号码：

匹配电子邮件：

Cookie

什么是Cookie：

Cookiejar实战精析：

Python3中的cookiejar库：

实现思路：

猜你喜欢

热点阅读

匹配`.com`或`.cn`后缀的`URL`网址：

`Cookie`

什么是`Cookie`：

`Cookiejar`实战精析：

`Python3`中的`cookiejar`库：