python爬虫日记本大数据 爬虫Python AI Sql爬虫专题

python-requests

2017-10-08  本文已影响142人  芒果浩明

换个地方浏览

Python-request库

一、requests库的安装

  1. 通过pip installing安装。在 windows 系统下,打开控制台。只需要输入命令 pip install requests ,即可安装。
  2. 安装requests库的另一种方法。打开这个网址, http://www.lfd.uci.edu/~gohlke/pythonlibs 在这个网站上面有很多 python 的第三方库文件,我们按 ctrl+f 搜索很容易找到 requests。
  1. 将 .whl文件下载下来后,将文件重命名,将后缀名从 .whl 改为 .zip ,然后解压文件,我们可以得到两个文件夹。将得到的第一个文件夹拷贝到python 的安装目录下的 lib 目录下。
  2. 测试。打开控制台。输入import request。如下图。说明已经安装成功

二、request库的测试。

import requests
r = requests.get("http://www.baidu.com")
print(r.status_code)
200

三、requests库的7个主要方法

方法 说明
requests.request() 构造一个请求,支撑以下个方法的基础
requests.get() 获取HTML网页的主要方法,对应于HTTP协议中的get方法
requests.head() 获取HTML网页头信息的方法,对应于HTTP中的HEAD
requests.post() 向HTML网页提交post请求的方法,对应于HTTP中的POST
requests.put() 向HTML网页提交put请求的方法,对应于HTTP中的put
requests.patch() 向HTML网页提交局部修改的请求,对应于HTTP中的PATCH
requests.delete() 向HTML网页提交删除的请求,对应于HTTP中的DELETE
  1. get()方法
    r = requests.get(url)
    首先,构造一个向服务器请求资源的requests对象。url是服务器的链接。请求成功之后,返回一个包含服务器资源的response对象。

import requests
r = requests.get("http://www.baidu.com")
r.status_code
200
type(r)
<class 'requests.models.Response'>
r.headers
{'Transfer-Encoding': 'chunked', 'Cache-Control': 'private, no-cache, no-store,     proxy-      revalidate, no-transform', 'Content-Type': 'text/html',                'Last-Modified': 'Mon, 23 Jan 2017 13:27:37 GMT', 'Set-Cookie': 'BDORZ=27315;      max-age=86400; domain=.baidu.com; path=/', 'Date': 'Sun, 16 Apr 2017 12:56:05      GMT', 'Server': 'bfe/1.0.8.18', 'Connection': 'Keep-Alive', 'Pragma':         'no-cache', 'Content-Encoding': 'gzip'}

属性 说明
r.status_code 状态码,HTTP请求返回的状态,200表示成功,404表示失败 r.textHTTP请求返回的内容的字符串形式,url页面的内容。url的html代码
r.encoding从HTTP header 中猜测响应内容的编码方式,只有默认是utf-8才可以正常显示
r.apparent_encoding 从内容中分析出来的编码方式(备用的编码方式)
r.contentHTT P响应内容的二进制形式

response对象的属性

属性 说明
r.status_code 状态码,HTTP请求返回的状态,200表示成功,404表示失败 r.textHTTP请求返回的内容的字符串形式,url页面的内容。url的html代码
r.encoding从HTTP header 中猜测响应内容的编码方式,只有默认是utf-8才可以正常显示
r.apparent_encoding 从内容中分析出来的编码方式(备用的编码方式)
r.contentHTT P响应内容的二进制形式
requests库的异常
异常 说明
requests.ConnectionError 网络连接错误异常,如DNS查询失败,拒绝连接等
requests.HTTPErrorHTTP 错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeOut 连接远程服务超时异常
requests.TimeOut请求url 超时异常

requests网络爬虫的通用代码框架


import requests
def getHTMLTEXT(url):
try:
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r.text
except:
    return "产生异常"

http协议对资源的操作

通过url和命令管理资源,操作独立无状态,网络通道及服务器成为了黑盒子。

理解PATCH和PUT的区别

假设url位置有一组数据UserUnfo,包括UserID、UserName等20个字段
需求:用户修改了UserName,其他不变

PATCH的主要好处是:节省网络带宽

http与request是库

http协议方法 request是方法 功能一致性
GET requests.get() 一致
HEAD requests.head() 一致
POST requests.post() 一致
PUT requests.put() 一致
PATCH requests.patch() 一致
DELETE requests.delete() 一致

requests库的head()方法

 r = requests.head('http://httpbin.org/get')
 r.headers
 {'Access-Control-Allow-Credentials': 'true', 'Date': 'Fri, 21 Apr 2017 13:04:43 GMT', 'Connection': 'keep-alive', 'Content-Type': 'application/json', 'Via': '1.1 vegur', 'Content-Length': '266', 'Access-Control-Allow-Origin': '*', 'Server': 'gunicorn/19.7.1'} 

requests库的post()方法

 payload = {'key1':'value1','key2':'value2'}
 r = requests.post('http://httpbin.org/post',data = payload)
 print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.13.0"
  }, 
  "json": null, 
  "origin": "223.74.34.119", 
  "url": "http://httpbin.org/post"
}

requests库的put()方法


 payload = {'key1':'value1','key2':'value2'}
 r = requests.put('http://httpbin.org/put,data = payload')
 print(r.text)

requests库request()方法

requests.request(method, url, **kwargs)

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request('GET','http://python123.io/ws',params=kv)
>>> print(r.url)
http://python123.io/ws?key2=value2&key1=value1
>>> 

data :字典、字节序列或文件对象,作为request的内容

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request('POST','http://python123.io/ws',data=kv)
>>> body = '主体内容'
>>> r = requests.request('POST','http://python123.io/ws',data=body)
>>> kv = {'key':'value'}
>>> r = requests.request('POST','http://python123.io/ws',json=kv)
>>> print(r.url)
http://python123.io/ws
>>> hd = {'user-agent':'Chrome/10'}
>>> r = requests.request('POST','http://python123.io/ws',headers=hd)
>>> print(r.headers)
{'Content-Length': '9', 'Connection': 'keep-alive', 'Content-Type': 'text/plain; charset=utf-8', 'Access-Control-Max-Age': '1728000', 'Date': 'Fri, 21 Apr 2017 13:48:38 GMT', 'Access-Control-Allow-Methods': 'GET, POST, PUT, PATCH, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Content-Type, Content-Length, Authorization, Accept, X-Requested-With', 'Access-Control-Allow-Origin': '*'}
>>> 

requests库爬虫实例

1、爬取京东网页

>>> import requests
>>> r = requests.get("http://item.jd.com/3434759.html")
>>> r.status_code
200
>>> r.encoding
'gbk'
>>> r.text[:1000]
'<!-- shouji -->\n<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />\n    <title>【锤子M1L】锤子 M1L(SM919)4GB+32GB 白色 全网通4G手机 双卡双待 全金属边框【行情 报价 价格 评测】-京东</title>\n    <meta name="keywords" content="smartisanM1L,锤子M1L,锤子M1L报价,smartisanM1L报价"/>\n    <meta name="description" content="【锤子M1L】京东JD.COM提供锤子M1L正品行货,并包括smartisanM1L网购指南,以及锤子M1L图片、M1L参数、M1L评论、M1L心得、M1L技巧等信息,网购锤子M1L上京东,放心又轻松" />\n    <meta name="format-detection" content="telephone=no">\n    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/3434759.html">\n    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/3434759.html">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n    <link rel="canonical" href="//item.jd.com/3434759.html"/>\n        <link rel="dns-prefetch" href="//misc.360buyimg.com"/>\n    <link rel="dns-prefetch" href="//static.360buyimg.com"/>\n    <link rel="dns-prefetch" href="//img10.360buyimg.c'

完整代码

import requests
>>> url = "http://item.jd.com/3434759.html"
>>> try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
    return"爬取成功"
except:
    print("爬取失败")

2、爬取亚马逊网页

>>> import requests
>>> kv = {'user-agent':'Mozila/5.0'}
>>> r = requests.get("https://www.amazon.cn/gp/product/B00MEY0VWW/ref=s9_acss_bw_cg_kin_1c1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=T5XCJNEEB6GW3DANWNW0&pf_rd_t=101&pf_rd_p=190844af-fd7e-4d63-b831-fbd5601cfa0d&pf_rd_i=116087071",headers=kv)
>>> r.status_code
200
>>> r.request.headers
{'Accept': '*/*', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'user-agent': 'Mozila/5.0'}
>>> r.text[:100]
'<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><sc'
>>> r.text[1000:2000]
')}}}})(ue_csm);\n\n\n    var ue_err_chan = \'jserr-rw\';\n(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==n&&c!==p||a.ec++;c&&c!=k||a.ecf++;b.pageURL=""+(e.location?e.location.href:"");b.logLevel=c;b.attribution=f.attribution||b.attribution;a.erl.push({ex:f,info:b})}}function l(a,b,c,e,g){d.ueLogError({m:a,f:b,l:c,c:""+e,err:g,fromOnError:1,args:arguments},g?{attribution:g.attribution,logLevel:g.logLevel}:void 0);return!1}var k="FATAL",m="ERROR",n="WARN",p="DOWNGRADED",a={ec:0,ecf:0,\npec:0,ts:0,erl:[],ter:[],mxe:50,startTimer:function(){a.ts++;setInterval(function(){d.ue&&a.pec<a.ec&&d.uex("at");a.pec=a.ec},1E4)}};l.skipTrace=1;h.skipTrace=1;h.isStub=1;d.ueLogError=h;d.ue_err=a;e.onerror=l})(ue_csm,window);\n\n\nvar ue_id = \'SAS6G3Q9MD9R54H1QQJS\',\n    ue_url = \'/gp/uedata\',\n    ue_navtiming = 1,\n    ue_mid = \'AAHKV2X7AFYLW\',\n    ue_sid = \'459-2260863-3297312\',\n    ue_sn = \'www.amazon.cn\',\n    ue_furl = \'fls-cn.amazon.cn'
>>> 

完整代码

>>> import requests
>>> url = "https://www.amazon.cn/gp/product/B00MEY0VWW/ref=s9_acss_bw_cg_kin_1c1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=T5XCJNEEB6GW3DANWNW0&pf_rd_t=101&pf_rd_p=190844af-fd7e-4d63-b831-fbd5601cfa0d&pf_rd_i=116087071"
>>> try:
    kv = {'user-agent':'Mozilla/5.0'}
    r= requests.get(url, headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

3、搜索引擎关键词提交

http://www.baidu.com/s?wd=keyword

http://www.so.com/s?q=keyword

百度关键词提交:

>>> import requests
>>> kv = {'wd':'python'}
>>> r = requests.get("http://www.baidu.com/s",params=kv)
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=python'
>>> len(r.text)
334380
>>> r.text[:1000]
'<!DOCTYPE html>\n<!--STATUS OK-->\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\n\n\n<html>\n\t<head>\n\t\t\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\t\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\n\t\t<meta content="always" name="referrer">\n        <meta name="theme-color" content="#2932e1">\n        <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />\n        <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg">\n        <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /> \n\t\t\n\t\t\n<title>python_百度搜索</title>\n\n\t\t\n\n\t\t\n<style data-for="result" type="text/css" id="css_newi_result">body{color:#333;background:#fff;padding:6px 0 0;margin:0;position:relative;min-width:900px}\nbody,th,td,.p1,.p2{font-family:arial}\np,form,ol,ul,li,dl,dt,dd,h3{margin:0;padding:0;list-style:none}\ninput{pad'
>>> 

完整代码

>>> import requests
>>> keyword = "python"
>>> try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

    
http://www.baidu.com/s?wd=python
322584
>>> 

360关键词提交:

>>> import requests
>>> kv = {'q':'python'}
>>> r = requests.get("http://www.so.com/s",params=kv)
>>> r.status_code
200
>>> r.request.url
'https://www.so.com/s?q=python'
>>> len(r.text)
227744
>>> print(r.text[2000:3000])
r,a.g-a-noline:hover em,.g-a-noline a:hover em{text-decoration:underline}.g-ellipsis{overflow:hidden;text-overflow:ellipsis;white-space:nowrap}.g-f-yahei{font-family:arial,"WenQuanYi Micro Hei","Microsoft YaHei",SimHei}.g-shadow{box-shadow:0 1px 1px rgba(0,0 ,0,0.06)}.g-clearfix{zoom:1}.g-card{border:1px solid #e5e5e5;font-size:13px;_zoom:1}.g-btn{border:0;border-radius:1px;box-sizing:content-box;cursor:pointer;display:inline-block;outline:none;overflow:hidden;padding:0 10px;text-align:center;text-decoration:none;vertical-align:middle}.g-btn-icon{display:inline-block;_padding-top:7px}.g-btn-green{background:#19b955;border:1px solid #19b955;color:#fff;font-size:12px;height:24px;line-height:24px}input.g-btn,button.g-btn{line-height:20px;*padding:0 5px}.g-clearfix:after{clear:both;content:'';display:block;height:0;visibility:hidden}.g-card .g-card-foot{border-top:1px solid #e5e5e5;height:36px;line-height:36px;padding:0 10px}.g-card .g-card-foot-open,.g-card .g-card-foot-close{padding:0}.g
>>> 

完整代码:

>>> import requests
>>> keyword = "python"
>>> try:
    kv = {'q':keyword}
    r = requests.get("http://www.so.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

    
https://www.so.com/s?q=python
215119
>>> 

4、网络图片的爬取和存储

网络图片的链接格式:
http://www.example.com/picture.jpg

一张图片的链接
http://image.baidu.com/search/detail?ct=503316480&z=undefined&tn=baiduimagedetail&ipn=d&word=%E7%88%B1%E5%A3%81%E7%BA%B8&step_word=&ie=utf-8&in=&cl=2&lm=-1&st=undefined&cs=3752912346,2452166601&os=3385445049,2037359231&simid=4144739372,697793985&pn=0&rn=1&di=83110628040&ln=1956&fr=&fmq=1492829415188_R&fm=&ic=undefined&s=undefined&se=&sme=&tab=0&width=undefined&height=undefined&face=undefined&is=0,0&istype=0&ist=&jit=&bdtype=0&spn=0&pi=0&gsm=0&hs=2&objurl=http%3A%2F%2Fh5.86.cc%2Fwalls%2F20150106%2F1440x900_b3cf5a29601634a.jpg&rpstart=0&rpnum=0&adpicid=0

>>> import requests
>>> path = "D:abc.jpg"
>>> url = "http://image.baidu.com/search/detail?ct=503316480&z=undefined&tn=baiduimagedetail&ipn=d&word=%E7%88%B1%E5%A3%81%E7%BA%B8&step_word=&ie=utf-8&in=&cl=2&lm=-1&st=undefined&cs=3752912346,2452166601&os=3385445049,2037359231&simid=4144739372,697793985&pn=0&rn=1&di=83110628040&ln=1956&fr=&fmq=1492829415188_R&fm=&ic=undefined&s=undefined&se=&sme=&tab=0&width=undefined&height=undefined&face=undefined&is=0,0&istype=0&ist=&jit=&bdtype=0&spn=0&pi=0&gsm=0&hs=2&objurl=http%3A%2F%2Fh5.86.cc%2Fwalls%2F20150106%2F1440x900_b3cf5a29601634a.jpg&rpstart=0&rpnum=0&adpicid=0"
>>> r = requests.get("
KeyboardInterrupt
>>> r = requests.get("http://image.baidu.com/search/detail?ct=503316480&z=undefined&tn=baiduimagedetail&ipn=d&word=%E7%88%B1%E5%A3%81%E7%BA%B8&step_word=&ie=utf-8&in=&cl=2&lm=-1&st=undefined&cs=3752912346,2452166601&os=3385445049,2037359231&simid=4144739372,697793985&pn=0&rn=1&di=83110628040&ln=1956&fr=&fmq=1492829415188_R&fm=&ic=undefined&s=undefined&se=&sme=&tab=0&width=undefined&height=undefined&face=undefined&is=0,0&istype=0&ist=&jit=&bdtype=0&spn=0&pi=0&gsm=0&hs=2&objurl=http%3A%2F%2Fh5.86.cc%2Fwalls%2F20150106%2F1440x900_b3cf5a29601634a.jpg&rpstart=0&rpnum=0&adpicid=0")
>>> r.status_code
200
>>> with open(path,'wb') as f:
    f.write(r.content)

    
118680
>>> f.close()
>>> 

5、Ip地址归属地查询

查询网址的接口
http://m.ip138.com/ip.asp?ip=ipaddress

>>> import requests
>>> url = 'http://m.ip138.com/ip.asp?ip='
>>> r = requests.get(url+'116.7.245.184')
>>> r.status_code
200
>>> r.text[-500:]
'"submit" value="查询" class="form-btn" />\r\n\t\t\t\t\t</form>\r\n\t\t\t\t</div>\r\n\t\t\t\t<div class="query-hd">ip138.com IP查询(搜索IP地址的地理位置)</div>\r\n\t\t\t\t<h1 class="query">您查询的IP:116.7.245.184</h1><p class="result">本站主数据:广东省深圳市  电信</p><p class="result">参考数据一:广东省广州市 电信</p>\r\n\r\n\t\t\t</div>\r\n\t\t</div>\r\n\r\n\t\t<div class="footer">\r\n\t\t\t<a href="http://www.miitbeian.gov.cn/" rel="nofollow" target="_blank">沪ICP备10013467号-1</a>\r\n\t\t</div>\r\n\t</div>\r\n\r\n\t<script type="text/javascript" src="/script/common.js"></script></body>\r\n</html>\r\n'
>>> 

完整代码:

>>> import requests
>>> url = 'http://m.ip138.com/ip.asp?ip='
>>> try:
    r = requests.get('http://m.ip138.com/ip.asp?ip='+'116.7.245.184')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:]
except:
    print("爬取失败")
>>>
上一篇 下一篇

猜你喜欢

热点阅读