Python(六十三)页面分析、HTTP原理和响应

2022-02-22  本文已影响0人  Lonelyroots

从今天开始带大家全方位的学习爬虫,如果有感兴趣的朋友记得留意我的发文哦!

http和https:超文本传输协议
https相比于http安全一些
我们在浏览器上搜索百度时,一般都是https://www.baidu.com
https://www.baidu.com/:80,浏览器会默认渲染80端口
http:80
https:443 80

举个例子:
http://gy123456.cn:8088/admin/
协议://hostname:port/path
协议:http或https
hostname:域名
port:端口
path:路由

页面分析、HTTP原理和响应/01_requests之请求.py:

import requests

"""
    请求库:
        它本身存在的意义是做网络测试的,但是之后被发展成为了一个爬虫工具之一
"""

headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
# response = requests.get(url="http://httpbin.org/get")      # get请求
# response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123")      # 第一种方式get请求传参
response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123",params={'kw':'百度'},headers=headers)      # 第二种方式get请求传参
# params参数 只会在get方法中使用,因为使用之后,url = http://httpbin.org/get?id=Lonelyroots&id2=123&kw=百度
# outputs:一个参数kw为\u767e\u5ea6:Unicode编码转中文,即百度

# response = requests.post(url="http://httpbin.org/post?id=Lonelyroots&id2=123")     # post请求,一般不用?传参
# response = requests.post(url="http://httpbin.org/post",data={'kw':'百度'})     # post请求,一般不用?传参,用data传参

# headers={
#     "content-type": "application/x-www-form-urlencoded"
# }
# response = requests.put(url="http://httpbin.org/put",headers=headers)      # put请求:用于更新资源

# response = requests.delete(url="http://httpbin.org/delete")        # delete请求,通常用于删除实例

# response = requests.head(url="http://httpbin.org/head")     # 返回请求头
# print(response.headers)

# response = requests.patch("http://httpbin.org/patch")       # patch请求(提交修改部分数据)比如设置邮件能见度,这个接口用来设置邮件是公共可见的还是私有的

# print(response)
print(response.text)

页面分析、HTTP原理和响应/02_requests子响应.py:

import requests

url = "https://www.baidu.com/"
# 反爬机制:网站检查UA,如果发现是UA爬虫程序,则拒绝提供     伪装
# 爬虫的默认UA"python-requests/2.27.1"

headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
response = requests.get(url=url,headers=headers,allow_redirects=False)      # allow_redirects设置是否可以重定向,设置为False以后,返回的响应路由是重定向前的路由

# print(response.text)        # 响应内容的字符串形式,用于看文字    (常用)
# print(response.content)       # 响应内容的二进制形式,用于看图片、视频、音频   (常用)
# print(type(response.content))       # <class 'bytes'>
# print(response.status_code)     # 状态码
# print(response.headers)     # 响应头
# print(response.request)     # requests对象
# print(response.request.headers)     # 请求头
# print(response.request.url)     # 打印请求的路由
# print(response.url)         # 打印的是响应的url,比如说设置重定向后的新路由  (常用)

print(response.cookies)
# outputs:
# <RequestsCookieJar[<Cookie BAIDUID=5E91A0E2ABA4BCC3B6D3F4D7B2EA1E66:FG=1 for .baidu.com/>]>

# print(response.cookies['BAIDUID'])      # cookies通过键取值,如AAB9E16204105B5ED6DB6123E67832BD:FG=1

print(requests.utils.dict_form_cookiejar(response.cookies))

02_页面分析、HTTP原理和响应/03_新浪.py:

import requests

url = "https://www.sina.com.cn/"
headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}

response = requests.get(url=url,headers=headers)

# print(response.text)
# print(response.encoding)        # ISO-8859-1
# print(response.apparent_encoding)       # utf-8

# 解决乱码
response.encoding = response.apparent_encoding
print(response.text)

02_页面分析、HTTP原理和响应/04_百度关键字.py:

import requests

# Ajax 在不刷新整个网页的情况下,刷新部分地方,如搜索关键字,但不确定搜索
# 抓pycharm关键字的url

# 通过抓关键字来找路由这种方式,不需要伪装UA

url = "https://www.baidu.com/sugrec?pre=1&p=3&ie=utf-8&json=1&prod=pc&from=pc_web&sugsid=35104,35488,34584,35490," \
      "35872,35949,35955,35316,26350,35941&wd=requests&req=2&csor=7&cb=jQuery110205986892274942364_1645516722140&_" \
      "=1645516722141 "

response = requests.get(url=url)

print(response.text)

02_页面分析、HTTP原理和响应/05_百度贴吧.py:

import requests
import os

"""
"https://tieba.baidu.com/f?kw=宝马&pn=0"      # 这里的kw,如果遇到乱码,可以去使用url的Unicode解码
"https://tieba.baidu.com/f?kw=宝马&pn=50"
"https://tieba.baidu.com/f?kw=宝马&pn=100"
"""
kw = input("请输入关键字:\t")

os_path = os.getcwd()+'/html/'+kw      # os.getcwd():当前路径位置:F:\learning_records\U1 2021.9.26\Python\06、爬虫开发\02_页面分析、HTTP原理和响应

# 如果文件夹不存在,则创建文件夹
if not os.path.exists(os_path):
    os.mkdir(os_path)

# 打印3页内容
for page in range(0,101,50):
    url = f'https://tieba.baidu.com/f?kw={kw}&pn={page}'
    response = requests.get(url=url)
    with open(f'html/{kw}/{kw}_{int(page/50)}.html','w',encoding='utf-8') as f:
        f.write(response.text)

文章到这里就结束了!希望大家能多多支持Python(系列)!六个月带大家学会Python,私聊我,可以问关于本文章的问题!以后每天都会发布新的文章,喜欢的点点关注!一个陪伴你学习Python的新青年!不管多忙都会更新下去,一起加油!

Editor:Lonelyroots

上一篇 下一篇

猜你喜欢

热点阅读