(二)requests库的学习

2018-12-03  本文已影响0人  费云帆

requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多,比如遇到一些高级的操作,urllib需要先创建handler对象,再构造opener对象去解决,requests就简便多了.

这位大神写得很好,我也是参考这篇学习,地址:
https://www.cnblogs.com/mzc1997/p/7813801.html

1.基本'get'的抓取,创建response对象,对象创建完成后,就可以利用对象相应的属性和方法,获取我们需要的信息:

import requests

url='http://www.baidu.com'
response=requests.get(url)
print(response.status_code)#网页的状态码
print(response.url)#请求的url
print(response.headers)#响应头
print(response.cookies)#cookies
print(response.text)#以文本的形式返回网页源码
print(response.content)#以bytes的方式返回网页源码

2.各种请求方式:

import requests

requests.get('http://httpbin.org/get')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

3.get请求的响应:

import requests

url='http://httpbin.org/get'
response=requests.get(url)
print(response.text)

{
  "args": {}, #表示参数
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "117.28.251.74", 
  "url": "http://httpbin.org/get"
}

4.get参数,方法有两种,一种是利用paramas,另一种是直接带参数:

import requests

url='http://httpbin.org/get'
data={
    'name':'Jim Green',
    'age':22
}
response=requests.get(url,params=data)
print(response.text)

{
  "args": {
    "age": "22", 
    "name": "Jim Green"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }

直接带参数的:

import requests

url='http://httpbin.org/get?name=Jim+Green&age=22'
response=requests.get(url)
print(response.text)

结果是一样的
5.json解析:

import requests

url='http://httpbin.org/get'
response=requests.get(url)
print(response.text)
print(type(response.text)) #str 类型
print(response.json()) #等同json.loads(response.text)
print(type(response.json()))---dict 类型

6.保存二进制文件(图片,视频等):

import requests

url='http://github.com/favicon.ico'
response=requests.get(url)
print(response.text) #乱码
print(response.content) #二进制
content=response.content
with open('E:\Bin\Python\picture.ico','wb') as file:
    file.write(content)

7.添加请求头:

import requests

url='http://httpbin.org/get'
headers={}
headers['User-Agent']='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'
response=requests.get(url,headers=headers)
print(response.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36"
  }, 
  "origin": "117.28.251.74", 
  "url": "http://httpbin.org/get"
}

8.添加代理IP:以下的例子实际是失败的

import requests

url='http://httpbin.org/get'
headers={}
headers['User-Agent']='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'
proxies={
    'HTTP':'95.66.157.74:34827',
    'HTTP':'93.191.14.103:43981'
}
response=requests.get(url,headers=headers,proxies=proxies)
print(response.text)

下面的例子,有空好好研究下:

import requests
import re

def get_html(url):
    proxy = {
        'HTTP':'95.66.157.74:34827',
        'HTTP':'93.191.14.103:43981'
    }
    heads = {}
    heads['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    req = requests.get(url, headers=heads,proxies=proxy)
    html = req.text
    return html

def get_ipport(html):
    regex = r'<td data-title="IP">(.+)</td>'
    iplist = re.findall(regex, html)
    regex2 = '<td data-title="PORT">(.+)</td>'
    portlist = re.findall(regex2, html)
    regex3 = r'<td data-title="类型">(.+)</td>'
    typelist = re.findall(regex3, html)
    sumray = []
    for i in iplist:
        for p in portlist:
            for t in typelist:
                pass
            pass
        a = t+','+i + ':' + p
        sumray.append(a)
    print('高匿代理')
    print(sumray)


if __name__ == '__main__':
    url = 'http://www.kuaidaili.com/free/'
    get_ipport(get_html(url))

高匿代理
['HTTP,117.191.11.80:8118', 'HTTP,116.7.176.75:8118', 'HTTP,183.129.207.82:8118', 'HTTP,120.77.247.147:8118', 'HTTP,183.129.207.82:8118', 'HTTP,183.129.207.82:8118', 'HTTP,183.129.207.82:8118', 'HTTP,121.232.148.118:8118', 'HTTP,101.76.209.69:8118', 'HTTP,111.75.193.25:8118', 'HTTP,120.26.199.103:8118', 'HTTP,111.43.70.58:8118', 'HTTP,27.214.112.102:8118', 'HTTP,124.42.7.103:8118', 'HTTP,113.69.137.222:8118']
上一篇下一篇

猜你喜欢

热点阅读