Python（六十三）页面分析、HTTP原理和响应

2022-02-22 本文已影响0人 Lonelyroots

从今天开始带大家全方位的学习爬虫，如果有感兴趣的朋友记得留意我的发文哦！

http和https：超文本传输协议
https相比于http安全一些
我们在浏览器上搜索百度时，一般都是https://www.baidu.com
https://www.baidu.com/:80，浏览器会默认渲染80端口
http：80
https：443 80

举个例子：
http://gy123456.cn:8088/admin/
协议://hostname:port/path
协议：http或https
hostname：域名
port：端口
path：路由

页面分析、HTTP原理和响应/01_requests之请求.py：

import requests

"""
    请求库：
        它本身存在的意义是做网络测试的，但是之后被发展成为了一个爬虫工具之一
"""

headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
# response = requests.get(url="http://httpbin.org/get")      # get请求
# response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123")      # 第一种方式get请求传参
response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123",params={'kw':'百度'},headers=headers)      # 第二种方式get请求传参
# params参数 只会在get方法中使用，因为使用之后，url = http://httpbin.org/get?id=Lonelyroots&id2=123&kw=百度
# outputs：一个参数kw为\u767e\u5ea6：Unicode编码转中文，即百度

# response = requests.post(url="http://httpbin.org/post?id=Lonelyroots&id2=123")     # post请求，一般不用?传参
# response = requests.post(url="http://httpbin.org/post",data={'kw':'百度'})     # post请求，一般不用?传参，用data传参

# headers={
#     "content-type": "application/x-www-form-urlencoded"
# }
# response = requests.put(url="http://httpbin.org/put",headers=headers)      # put请求：用于更新资源

# response = requests.delete(url="http://httpbin.org/delete")        # delete请求，通常用于删除实例

# response = requests.head(url="http://httpbin.org/head")     # 返回请求头
# print(response.headers)

# response = requests.patch("http://httpbin.org/patch")       # patch请求（提交修改部分数据）比如设置邮件能见度，这个接口用来设置邮件是公共可见的还是私有的

# print(response)
print(response.text)

页面分析、HTTP原理和响应/02_requests子响应.py：

import requests

url = "https://www.baidu.com/"
# 反爬机制：网站检查UA，如果发现是UA爬虫程序，则拒绝提供     伪装
# 爬虫的默认UA"python-requests/2.27.1"

headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
response = requests.get(url=url,headers=headers,allow_redirects=False)      # allow_redirects设置是否可以重定向，设置为False以后，返回的响应路由是重定向前的路由

# print(response.text)        # 响应内容的字符串形式，用于看文字    （常用）
# print(response.content)       # 响应内容的二进制形式，用于看图片、视频、音频   （常用）
# print(type(response.content))       # <class 'bytes'>
# print(response.status_code)     # 状态码
# print(response.headers)     # 响应头
# print(response.request)     # requests对象
# print(response.request.headers)     # 请求头
# print(response.request.url)     # 打印请求的路由
# print(response.url)         # 打印的是响应的url，比如说设置重定向后的新路由  （常用）

print(response.cookies)
# outputs：
# <RequestsCookieJar[<Cookie BAIDUID=5E91A0E2ABA4BCC3B6D3F4D7B2EA1E66:FG=1 for .baidu.com/>]>

# print(response.cookies['BAIDUID'])      # cookies通过键取值，如AAB9E16204105B5ED6DB6123E67832BD:FG=1

print(requests.utils.dict_form_cookiejar(response.cookies))

02_页面分析、HTTP原理和响应/03_新浪.py：

import requests

url = "https://www.sina.com.cn/"
headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}

response = requests.get(url=url,headers=headers)

# print(response.text)
# print(response.encoding)        # ISO-8859-1
# print(response.apparent_encoding)       # utf-8

# 解决乱码
response.encoding = response.apparent_encoding
print(response.text)

02_页面分析、HTTP原理和响应/04_百度关键字.py：

import requests

# Ajax 在不刷新整个网页的情况下，刷新部分地方，如搜索关键字，但不确定搜索
# 抓pycharm关键字的url

# 通过抓关键字来找路由这种方式，不需要伪装UA

url = "https://www.baidu.com/sugrec?pre=1&p=3&ie=utf-8&json=1&prod=pc&from=pc_web&sugsid=35104,35488,34584,35490," \
      "35872,35949,35955,35316,26350,35941&wd=requests&req=2&csor=7&cb=jQuery110205986892274942364_1645516722140&_" \
      "=1645516722141 "

response = requests.get(url=url)

print(response.text)

02_页面分析、HTTP原理和响应/05_百度贴吧.py：

import requests
import os

"""
"https://tieba.baidu.com/f?kw=宝马&pn=0"      # 这里的kw，如果遇到乱码，可以去使用url的Unicode解码
"https://tieba.baidu.com/f?kw=宝马&pn=50"
"https://tieba.baidu.com/f?kw=宝马&pn=100"
"""
kw = input("请输入关键字：\t")

os_path = os.getcwd()+'/html/'+kw      # os.getcwd()：当前路径位置：F:\learning_records\U1 2021.9.26\Python\06、爬虫开发\02_页面分析、HTTP原理和响应

# 如果文件夹不存在，则创建文件夹
if not os.path.exists(os_path):
    os.mkdir(os_path)

# 打印3页内容
for page in range(0,101,50):
    url = f'https://tieba.baidu.com/f?kw={kw}&pn={page}'
    response = requests.get(url=url)
    with open(f'html/{kw}/{kw}_{int(page/50)}.html','w',encoding='utf-8') as f:
        f.write(response.text)

文章到这里就结束了！希望大家能多多支持Python（系列）！六个月带大家学会Python，私聊我，可以问关于本文章的问题！以后每天都会发布新的文章，喜欢的点点关注！一个陪伴你学习Python的新青年！不管多忙都会更新下去，一起加油！

Editor：Lonelyroots

Python（六十三）页面分析、HTTP原理和响应

猜你喜欢

热点阅读