(二)requests库的学习
2018-12-03 本文已影响0人
费云帆
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多,比如遇到一些高级的操作,urllib需要先创建handler对象,再构造opener对象去解决,requests就简便多了.
这位大神写得很好,我也是参考这篇学习,地址:
https://www.cnblogs.com/mzc1997/p/7813801.html
1.基本'get'的抓取,创建response对象,对象创建完成后,就可以利用对象相应的属性和方法,获取我们需要的信息:
import requests
url='http://www.baidu.com'
response=requests.get(url)
print(response.status_code)#网页的状态码
print(response.url)#请求的url
print(response.headers)#响应头
print(response.cookies)#cookies
print(response.text)#以文本的形式返回网页源码
print(response.content)#以bytes的方式返回网页源码
2.各种请求方式:
import requests
requests.get('http://httpbin.org/get')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')
3.get请求的响应:
import requests
url='http://httpbin.org/get'
response=requests.get(url)
print(response.text)
{
"args": {}, #表示参数
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "117.28.251.74",
"url": "http://httpbin.org/get"
}
4.get参数,方法有两种,一种是利用paramas,另一种是直接带参数:
import requests
url='http://httpbin.org/get'
data={
'name':'Jim Green',
'age':22
}
response=requests.get(url,params=data)
print(response.text)
{
"args": {
"age": "22",
"name": "Jim Green"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
}
直接带参数的:
import requests
url='http://httpbin.org/get?name=Jim+Green&age=22'
response=requests.get(url)
print(response.text)
结果是一样的
5.json解析:
import requests
url='http://httpbin.org/get'
response=requests.get(url)
print(response.text)
print(type(response.text)) #str 类型
print(response.json()) #等同json.loads(response.text)
print(type(response.json()))---dict 类型
6.保存二进制文件(图片,视频等):
import requests
url='http://github.com/favicon.ico'
response=requests.get(url)
print(response.text) #乱码
print(response.content) #二进制
content=response.content
with open('E:\Bin\Python\picture.ico','wb') as file:
file.write(content)
7.添加请求头:
import requests
url='http://httpbin.org/get'
headers={}
headers['User-Agent']='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'
response=requests.get(url,headers=headers)
print(response.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36"
},
"origin": "117.28.251.74",
"url": "http://httpbin.org/get"
}
8.添加代理IP:以下的例子实际是失败的
import requests
url='http://httpbin.org/get'
headers={}
headers['User-Agent']='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'
proxies={
'HTTP':'95.66.157.74:34827',
'HTTP':'93.191.14.103:43981'
}
response=requests.get(url,headers=headers,proxies=proxies)
print(response.text)
下面的例子,有空好好研究下:
import requests
import re
def get_html(url):
proxy = {
'HTTP':'95.66.157.74:34827',
'HTTP':'93.191.14.103:43981'
}
heads = {}
heads['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
req = requests.get(url, headers=heads,proxies=proxy)
html = req.text
return html
def get_ipport(html):
regex = r'<td data-title="IP">(.+)</td>'
iplist = re.findall(regex, html)
regex2 = '<td data-title="PORT">(.+)</td>'
portlist = re.findall(regex2, html)
regex3 = r'<td data-title="类型">(.+)</td>'
typelist = re.findall(regex3, html)
sumray = []
for i in iplist:
for p in portlist:
for t in typelist:
pass
pass
a = t+','+i + ':' + p
sumray.append(a)
print('高匿代理')
print(sumray)
if __name__ == '__main__':
url = 'http://www.kuaidaili.com/free/'
get_ipport(get_html(url))
高匿代理
['HTTP,117.191.11.80:8118', 'HTTP,116.7.176.75:8118', 'HTTP,183.129.207.82:8118', 'HTTP,120.77.247.147:8118', 'HTTP,183.129.207.82:8118', 'HTTP,183.129.207.82:8118', 'HTTP,183.129.207.82:8118', 'HTTP,121.232.148.118:8118', 'HTTP,101.76.209.69:8118', 'HTTP,111.75.193.25:8118', 'HTTP,120.26.199.103:8118', 'HTTP,111.43.70.58:8118', 'HTTP,27.214.112.102:8118', 'HTTP,124.42.7.103:8118', 'HTTP,113.69.137.222:8118']