爬虫笔记(3):正则表达式与Cookie

2017-12-13  本文已影响0人  WeirdoSu

正则表达式基础知识:

原子:正则表达式中最基本的组成单位:

模式修正:

贪婪模式与懒惰模式:

常见函数:

常见实例:

匹配.com.cn后缀的URL网址:
In [1]: import re
In [2]: pattern = "[a-zA-Z]+://[^\s]*[.com|.cn]"
In [3]: string = "<a href='http:// www.baidu.com'>百度首页</a>"
In [4]: result = re.search(pattern, string)
In [5]: print(result)
None
匹配电话号码:
In [6]: pattern2 = "/d{4}-/d{7}|\d{3}-\d{8}"
In [7]: string2 = "021-23423432423423423423"
In [8]: result2 = re.search(pattern2, string2)
In [9]: print(result2)
<_sre.SRE_Match object; span=(0, 12), match='021-23423432'>
匹配电子邮件:
In [10]: pattern3 = "\w+([.+-]\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*"
In [11]: string3 = "mailto:c-e+o@iqi-anyue.com.cn"
In [12]: result3 = re.search(pattern3, string3)
In [13]: print(result3)
<_sre.SRE_Match object; span=(7, 29), match='c-e+o@iqi-anyue.com.cn'>

Cookie

什么是Cookie

如果涉及登录操作,会用到Cookie
HTTP是一个无状态协议,无法维持会话之间的状态;所以将登录成功等信息通过一些方式保存下来:CookieSession

Cookiejar实战精析:

如果要获得真实登录地址,需要对网页进行分析,方法有两种:

Python3中的cookiejar库:
In [23]: url2 = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&l
    ...: oginsubmit=yes&loginhash=L768q"

In [24]: postdata2 = urllib.parse.urlencode({"username":"weisuen", "password":"
    ...: aA123456"}).encode('utf-8')

In [25]: req2 = urllib.request.Request(url2, postdata2)

In [26]: req2.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 
    ...: 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Sa
    ...: fari/537.36')

In [27]: res2 = urllib.request.urlopen(req2).read()

In [28]: filename2 = "desktop/programming/python_work/cookie_test.html"

In [29]: with open(filename2, "wb") as f:
    ...:     f.write(res2)
    ...:  
实现思路:
  1. 导入Cookie处理模块http.cookiejar;
  2. 使用http.cookiejar.CookieJar()创建CookieJar对象;
  3. 使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象;
  4. 创建全局默认的opener对象;
In [30]: import http.cookiejar
# 模拟浏览器,传送数据
In [31]: url = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&lo
    ...: ginsubmit=yes&loginhash=L768q"
In [32]: postdata = urllib.parse.urlencode({"username":"weisuen", "password":"a
    ...: A123456"}).encode('utf-8')
In [33]: req = urllib.request.Request(url, postdata)
In [34]: req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10
    ...: _13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safa
    ...: ri/537.36')
# Cookie
In [35]: cjar = http.cookiejar.CookieJar()
In [36]: opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
In [37]: urllib.request.install_opener(opener)
In [38]: file = opener.open(req)
# 后续操作
In [39]: data = file.read()
In [40]: filename = "desktop/programming/python_work/cookie_test2.html"
In [41]: with open(filename, "wb") as f:
    ...:     f.write(data)
    ...:     

In [42]: url2 = "http://bbs.chinaunix.net/"
In [43]: data2 = urllib.request.urlopen(url2).read()
In [44]: with open("desktop/programming/python_work/cookie_test2_add.html", "wb") as f:
    ...:     f.write(data2)
    ...:     
上一篇下一篇

猜你喜欢

热点阅读