14-通用爬虫模块-数据获取

2019-03-08  本文已影响0人  努力爬行中的蜗牛
爬虫基本概念

1. HTTP和HTTPS
HTTP:超文本传输协议,默认端口80
HTTPS:HTTP+SSL(安全套接字层),默认端口443
HTTPS比HTTP安全,但性能低。

2. url形式
形式:scheme://host[:port]/path/.../[?query-string][#anchor]
scheme:协议(例如http、https、ftp)
host:服务器的ip地址或者域名
port:服务器端口号(如果是走协议默认端口,如80 or 443)
path:访问资源的路径
query-string:参数,发送给http服务器的数据
anchor:锚(跳转到网页的指定锚点位置)
http://localhost:4000/filepart1/1.html
http://item.jd.com/119876.html/#product-detail

3. url请求形式

url请求形式 .png

4. HTTP常见请求头

5. 响应状态码

6. 爬虫的分类

7. 爬虫的概念
网络爬虫(又被称为网络蜘蛛,网络机器人),就是模拟客户端发送网络请求,接收请求响应,一种按照一定的规则,自动的抓取互联网信息的程序。
只要浏览器能做的事情,理论上,爬虫都能做。

8. 爬虫的更多用途

9. 通用搜索引擎的局限

10. ROBOTS协议
Robots协议:网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。
https://www.taobao.com/robots.txt

11. 页面上的数据在哪里

爬虫request库

1. requests使用入门
【文档地址】http://docs.python-requests.org/zh_CN/latest/index.html
requests和urlib的区别:

2. requests中解决编解码的方法

3. response.text和response.content的区别

4. 保存图片

import requests

r = requests.get("https://www.baidu.com/img/bd_logo1.png")

with open("a.png", "wb") as f:
    f.write(r.content)

5. 发送简单的请求

response = requests.get("http://www.baidu.com") 

response 的常用方法:

response.text
response.content
response.status_code
response.request.header
response.headers

6. 发送带header的请求

7. 发送带参数的请求

8. 贴吧爬虫

import requests


class TiebaSpider:
    def __init__(self, tieba_name):
        self.tieba_name = tieba_name
        self.url_temp = "https://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
        self.headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}

    def get_url_list(self):  # 构造url列表
        # url_list = []
        # for i in range(1000):
        #     url_list.append(self.url_temp.format(i * 50))
        # return url_list
        return [self.url_temp.format(i*50) for i in range(1000)]

    def parse_url(self, url):  # 发送请求,获取响应
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def save_html(self, html_str,page_num):  # 保存html字符串
        file_path = "{}-第{}页.html".format(self.tieba_name, page_num)
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(html_str)

    def run(self):  # 实现主要逻辑
        # 1. 构造url列表
        url_list = self.get_url_list()
        # 2. 遍历,发送请求,获取请求
        for url in url_list:
            html_str = self.parse_url(url)
            # 3. 保存
            page_num = url_list.index(url) + 1  # 页码数
            self.save_html(html_str, page_num)


if __name__ == '__main__':
    tieba_spider = TiebaSpider("李毅")
    tieba_spider.run()

9. 发送POST请求

百度翻译爬虫

mport requests
import json
import sys

query_string = sys.argv[1]

headers = {"User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"}

post_data = {
    "query": query_string,
    "from": "zh",
    "to": "en",
}

post_url = "http://fanyi.baodu.com/basetrans"

r = requests.post(post_url, data=post_data, headers=headers)

dict_rect = json.loads(r.content.decode())

ret = dict_rect["trans"][0]["dst"]

print("resulst is :", ret)

10. 使用代理

代理.png 代理.png
proxies:{
      "http":"http://12.34.56.78:9527",
      "https":"https://12.34.56.78:9527"
}

11. 使用代理ip注意事项

12. cookie和session区别

13. 爬虫处理cookie和session

不要cookie的时候,尽量不要去使用cookie
但是为了获取登录之后的页面,我们必须发送带有cookie的请求

14. 处理cookie和session请求
requests提供了一个session的类,来实现客户端和服务端的会话保持。
使用方法:
1.实例化一个session对象
2.让session发送get或者post请求
session = requests.session()
response = session.get(url,headers)

15. 请求登录之后网站的思路

16. demo

方式1:实例化session,使用session发送post请求,在使用他获取登录后的页面

import requests

session = requests.session()
post_url = "http://192.168.0.223:8080/redmine/login"
post_data = {"username": "zyx", "password": "30********35"}
headers = {"User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"}
# 使用session发送post请求,cookie保存在其中
session.post(post_url, data=post_data, headers=headers)
# 在使用session进行请求登录之后才能访问的地址
r = session.get("http://192.168.0.223:8080/redmine/issues?utf8=%E2%9C%93&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=%21&v%5Bstatus_id%5D%5B%5D=3&f%5B%5D=assigned_to_id&op%5Bassigned_to_id%5D=%3D&v%5Bassigned_to_id%5D%5B%5D=me&f%5B%5D=created_on&op%5Bcreated_on%5D=%3E%3D&v%5Bcreated_on%5D%5B%5D=2019-03-01&f%5B%5D=&c%5B%5D=project&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=parent&c%5B%5D=start_date&c%5B%5D=due_date&c%5B%5D=done_ratio&group_by=")

# 保存页面
with open("renren.html", "w", encoding="utf-8") as f:
    f.write(r.content.decode())

方式2:headers中添加cookie键,值为cookie字符串

import requests

headers = {
"User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
    "Cookie": "_redmine_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTFlMzNjNjdjNWUyOGNkNDY5MjM2M2VmYWE4NGM2MTk2BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMUM4N3M1bEJLT204dTlQWFkxcUw2bmRBUy9aSXVNTUxPelNaQTU2TTlreHc9BjsARg%3D%3D--025ba44479499b96abfbfdd8bc77ead00d4f0a4c",
}

# 在使用requests进行请求登录之后才能访问的地址
r = requests.get("http://192.168.0.223:8080/redmine/issues?utf8=%E2%9C%93&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=%21&v%5Bstatus_id%5D%5B%5D=3&f%5B%5D=assigned_to_id&op%5Bassigned_to_id%5D=%3D&v%5Bassigned_to_id%5D%5B%5D=me&f%5B%5D=created_on&op%5Bcreated_on%5D=%3E%3D&v%5Bcreated_on%5D%5B%5D=2019-03-01&f%5B%5D=&c%5B%5D=project&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=parent&c%5B%5D=start_date&c%5B%5D=due_date&c%5B%5D=done_ratio&group_by=")

# 保存页面
with open("renren2.html", "w", encoding="utf-8") as f:
    f.write(r.content.decode())

方式3:在请求方法中添加cookies参数,接收字典形式的cookie。字典形式的cookie中的键是cookie的name对应的值,值是cookie的value对应的值。
字典推导式、列表推导式

import requests

headers = {
        "User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
    }
cookies = "_redmine_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTFlMzNjNjdjNWUyOGNkNDY5MjM2M2VmYWE4NGM2MTk2BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMUM4N3M1bEJLT204dTlQWFkxcUw2bmRBUy9aSXVNTUxPelNaQTU2TTlreHc9BjsARg%3D%3D--025ba44479499b96abfbfdd8bc77ead00d4f0a4c"
cookies = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
print(cookies)
# 在使用requests进行请求登录之后才能访问的地址
r = requests.get("http://192.168.0.223:8080/redmine/issues?utf8=%E2%9C%93&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=%21&v%5Bstatus_id%5D%5B%5D=3&f%5B%5D=assigned_to_id&op%5Bassigned_to_id%5D=%3D&v%5Bassigned_to_id%5D%5B%5D=me&f%5B%5D=created_on&op%5Bcreated_on%5D=%3E%3D&v%5Bcreated_on%5D%5B%5D=2019-03-01&f%5B%5D=&c%5B%5D=project&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=parent&c%5B%5D=start_date&c%5B%5D=due_date&c%5B%5D=done_ratio&group_by=",cookies=cookies)

# 保存页面
with open("renren3.html", "w", encoding="utf-8") as f:
    f.write(r.content.decode())

17. 不发送post请求,使用cookie获取登录后的页面

chrome分析post和json
  1. 寻找登录的post地址
  1. 定位想要的js
  1. 安装第三方模块
  1. requests小技巧
import requests

response = requests.get("http://www.baidu.com")

print(response.cookies)
# 把cookie对象转换为字典
dict = requests.utils.dict_from_cookiejar(response.cookies)
print(dict)

url = "http://tieba.baidu.com/f/index/forumpark?cn=%E9%A6%99%E6%B8%AF%E7%94%B5%E5%BD%B1&ci=0&pcn=%E7%94%B5%E5%BD%B1&pci=0&ct=1&rn=20&pn=1"
# 对地址进行编码
url = requests.utils.quote(url)
print("编码:", url)
# 对地址进行解码
url = requests.utils.unquote(url)
print("解码:", url)

# 请求SSL证书验证
response = requests.get("http://eolinker.guxiansheng.cn/#/index", verify=False)
print(response)

# 设置超时参数
response = requests.get(url, timeout=10)

# 配合状态码判断请求是否成功
assert response.status_code == 200
import requests
from retrying import retry


headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}


@retry(stop_max_attempt_number=3)
def _parse_url(url, method, data, proxies):
    print("*" * 20)
    if method == "POST":
        response = requests.post(url, data=data, headers=headers, proxies=proxies)
    else:
        response = requests.get(url, headers=headers, timeout=3, proxies=proxies)
    assert response.status_code == 200
    return response.content.decode()


def parse_url(url, method="GET", data=None, proxies={}):
    try:
        html_str = _parse_url(url, method, data, proxies)
    except:
        html_str = None

    return html_str


if __name__ == '__main__':
    url = "www.baidu.com"
    print(parse_url(url))
上一篇下一篇

猜你喜欢

热点阅读