Python爬虫入门-1.Request库入门

2019-06-01 本文已影响0人波波在敲代码

《Python爬虫入门》系列为学习北理工昊天老师课程所做的笔记。

1.Request及其安装

cmd命令行下通过pip3命令安装Request，及相关的几个库：

pip3 install request

2.两个简单的例子

2.1 爬取百度首页

### 百度首页的爬取
import requests # 引入requests库
vText = requests.get("http://www.baidu.com") # 对指定网址进行爬取，并将结果返回
print("状态码为：%i" % (vText.status_code)) # 输出访问的状态码
vText.encoding = "UTF-8" # 指定编码类型为UTF-8格式
print("爬取到的内容的前400个字符为：" ,vText.text[:400]) # 打印返回内容的前400个字符

状态码为：200
爬取到的内容的前400个字符为： <!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div

状态码为200标识访问成功。

requests库的get方法是其包含的7种方法（函数）中，最基本的一个，通过request.get()就可以获得指定网站的html代码。

2.2 爬取京东商品

### 爬取一种京东商品页面
import requests
vUrl = "https://item.jd.com/2138009.html"
vText = requests.get(vUrl)
vText.raise_for_status() # 如果返回的状态码不为200，则raise_for_status()会产生异常
vText.encoding = vText.apparent_encoding # 有程序自动分辨是哪种编码格式
print(vText.text[:500])

<!DOCTYPE HTML>
<html lang="zh-CN">
<head>
    <!-- shouji -->
    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
    <title>【优越者Y-3098ABK】优越者(UNITEK)usb分线器3.0 带电源接口3.0高速4口HUB扩展0.3米 笔记本电脑一拖四多接口HUB集线器Y-3098ABK【行情 报价 价格 评测】-京东</title>
    <meta name="keywords" content="优越者Y-3098ABK,优越者Y-3098ABK,优越者Y-3098ABK报价,优越者Y-3098ABK报价"/>
    <meta name="description" content="【优越者Y-3098ABK】京东JD.COM提供优越者Y-3098ABK正品行货，并包括优越者Y-3098ABK网购指南，以及优越者Y-3098ABK图片、Y-3098ABK参数、Y-3098ABK评论、Y-3098ABK心得、

3.报头

### 爬取一种京东商品页面
import requests
vUrl = "https://item.jd.com/2138009.html"
vText = requests.get(vUrl)
vText.request.headers

{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

通过request.headers命令可以查看报头内容中requests库发送给网站的身份信息，例如上例中显示这个访问请求是由python-requests产生的。由于有些网站并不喜欢被各种爬虫进行爬取，有可能会对爬虫进行限制。所以需要在报头中将身份伪装成浏览器。

### 一个亚马逊商品页面的爬取
import requests,re
vUrl = "https://www.amazon.cn/dp/B01K78MSHW"
vKv = {"user-agent": "Mozilla/5.0"} # 利用字典格式，通过键值对构建虚假身份
vText = requests.get(vUrl, headers = vKv) # 添加身份信息的参数
vText.raise_for_status()

由于亚马逊文件返回的结果特别长，所以暂不演示了。

4.异常处理

实际应用中，会由于网速或者是其他的各种原因而产生错误，一个好的爬虫框架应该具备完善的错误提示，下述代码通过try方法有效的告知用户程序是否正常运行：

### 通用网页爬取框架
import requests
def fGetHtmlText(vUrl):
    try:
        vKv = {"user-agent": "Mozilla/5.0"}
        vText = requests.get(url, headers = vKv, timeout = 30)
        vText.raise_for_status() # 如果状态码不是200则引发报错
        r.encoding = r.apparent_encoding # 自动判断网页编码格式
        return vText.text
    except:
        return "爬虫程序产生异常"
if __name__ == "__main__":
    vUrl = "http://1x1y.top/na"
    vText = fGetHtmlText(vUrl)
    print(vText[:1000])

<!DOCTYPE html>
<html lang="zh">
<head>
  <link rel="shortcut icon" href=" /favicon.ico" /> 
  <meta charset="UTF-8">
  <meta name="keywords" content="导航">
  <meta name="description" content="这是专用于自己的导航站，收录了自己比较常登陆的网站。">
  <title>一❤一意·自用导航</title>
  <title>Hello World - </title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Arimo:400,700,400italic">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/themes/WebStack/css/fonts/linecons/css/linecons.css">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/themes/WebStack/css/font-awesome.min.css">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/themes/WebStack/css/bootstrap.css">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/themes/WebStack/css/xenon-core.css">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/themes/WebStack/css/xenon-components.css">
  <link rel="stylesheet" href="http://1x1y.top/na/usr/t