爬虫-urllib
2018-12-21 本文已影响0人
看三小
1、导入urllib模块
import urllib.request
import urllib.parse
2、获取url
url = 'https://www.baidu.com/'
3、获取请求头header
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1 Safari/537.36',
}
4、加载页面时发送请求(三部曲)
request = urllib.request.Request(url=url,headers=self.header)
response = urllib.request.urlopen(request)
content = response.read().decode()
5、解析提取内容的正则
1----导入正则 import re
2----写提取正则内容
pattern = re.compile(r'<dd>.?board-index.?>(\d+)</i>.?data-src="(.?)".?name"><a'
+ '.?>(.?)</a>.?star">(.?)</p>.?releasetime">(.?)</p>'
+ '.?integer">(.?)</i>.?fraction">(.?)</i>.?</dd>',re.S)