python爬虫

2018-07-22 本文已影响0人 icessun

简单网页爬虫

import urllib.request
response = urllib.request.urlopen('http://gitbook.cn')
html=response.read()
print(html)

上面的代码是一个简单的网页爬虫代码：

urllib.request.urlopen('http://gitbook.cn')
打开一个网站，返回一个类似文件的对象
response.read()
读取变量response文件中全部的内容

模仿浏览器访问

把上面访问的网址换成百度的网址，发现爬取不到内容。因为有些网站会做一些预防，访问该网站如果不是通过浏览器，那么就会读取不到信息。

baidu的结果

import urllib.request
response = urllib.request.Request('https://www.baidu.com/')
response.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')
html= urllib.request.urlopen(response)
file = html.read()
print(file)

response.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')
欺骗服务器，是通过浏览器访问的

请求头查看
urllib.request.Request('https://www.baidu.com/')
一个抽象的url请求访问

对网页信息进行编码

file = html.read().decode('utf-8')
print(file)

这样就方便人去阅读爬取的结果

`Requests`网页访问利器

安装

打开CMD控制台输入

pip install requests

检测安装成功

重写第一个爬虫程序

import requests
url='http://gitbook.cn'
web_data = requests.get(url)
web_info = web_data.text
print(web_info)

这段代码也能够达到我们需要的效果。

requests.get(url)
通过get方式访问网站，Requests会基于http头部对应的编码做出有根据的推测，能够根据推测的文本编码进行解析
定制请求头

header = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,   */*;q=0.8',
            'Accept-Language':'zh-CN,zh;q=0.9',
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
           }
   r = requests.get("http://gitbook.cn/", headers=headers)

保持会话的机制
平时上网时都是使用无状态的 HTTP 协议传输出数据，这意味着客户端与服务端在数据传送完成后就会中断连接。
- Cookie
  Cookie的小量信息能帮助我们跟踪会话。一般该信息记录用户身份。
- Session
  相对应的大量信息
与Cookie一起使用，获取用户信息

Cookie获取

import requests
url = 'https://mp.csdn.net/'
header = {
    'Cookie':'此处隐藏个人Cookie',
    'User-Agent' :' '
}
web_data = requests.get(url,header)
web_info = web_data.text
print(web_info)

最终爬取的结果中，已经包含了登录时收藏的相关文章！Cookie设置成功！

`BeautifulSoup` 网页解析利器

是一个可以从 HTML 或 XML 文件中提取数据的 Python 库;

安装

和安装Requests一样，在控制台输入命令：

pip install BeautifulSoup4

BeautifulSoup只是封装了一些常用的功能，解析网页仍然需要一个解析器进行工作，所以还需要安装解析器：

BeautifulSoup中的解析器

# 安装速度较快的解析器：lxml
pip install lxml

改写上面的代码：

from bs4 import BeautifulSoup
import requests
url = "http://gitbook.cn"
header ={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
web_data = requests.get(url,header) # 获取对应url和定制的请求头的网站信息
soup = BeautifulSoup(web_data.text,'lxml') # 传入Requests获取的网页数据，使用lxml解析器去解析
title = soup.title.string  # 获取title标签中的内容： GitChat
urls = soup.find_all('a')  # 找到所有的a标签

BeautifulSoup(web_data.text,'lxml')
传入Requests爬取的网页内容web_data.text，并使用解析器lxml进行解析
BeautifulSoup将复杂的 HTML代码解析为了一个树形结构。每个节点都是可操作的 Python对象，常见的有四种：
- Tag
  html中的标签，如：title = soup.title，获取的是title标签；特别的tag：网站的关键字；获取该网站的大概内容

description = soup.find(attrs={"name": "description"})
keywords = soup.find(attrs={"name": "keywords"})

NavigableString
获取标签中的内容：soup.title.string

title = soup.title.string
print(title)   # 输出 GitChat
description = soup.find(attrs={"name": "description"})['content']
keywords = soup.find(attrs={"name": "keywords"})['content']

BeautifulSoup
表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，它支持遍历文档树和搜索文档树中描述的大部分的方法。
Comment
是一个特殊类型的 NavigableString 对象，但是当它出现在 HTML 文档中时，如果不对 Comment 对象进行处理，那么我们在后续的处理中可能会出现问题。具体的解决方案，可以参考：《关于 BeautifulSoup 的总结》。

`BeautifulSoup`实战

from bs4 import BeautifulSoup
import requests
url = 'https://gitbook.cn/'
header ={
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
}
web_data = requests.get(url,header)
soup = BeautifulSoup(web_data.text,'lxml')
title = soup.title.string
print(title)  # GitChat
theme = soup.select('#indexRight > div:nth-of-type(4) > a:nth-of-type(1) > span')[0].get_text()
print(theme)   # 前端

在 Tag或 BeautifulSoup对象的.select() 方法中传入字符串参数，即可使用 CSS选择器的语法找到 Tag。但是要把 child修改为 of-type
css选择器定位文字的位置：
- 右击鼠标，选择"检查"
- 在弹出的页面中，选定对应的文字
- 右击鼠标，选择"copy"
- 选择"copy selector"
get获取指定的属性，文档位置

text = soup.get_text()    # 获取标签中的内容
class_name = soup.select("path")[0].get("class")  # 获取 class 类名
url_address = soup.select("path")[0].get("href")  # 获取 url 地址

python爬虫

简单网页爬虫

模仿浏览器访问

对网页信息进行编码

`Requests`网页访问利器

安装

重写第一个爬虫程序

`BeautifulSoup` 网页解析利器

安装

`BeautifulSoup`实战

猜你喜欢

热点阅读

python爬虫

简单网页爬虫

模仿浏览器访问

对网页信息进行编码

Requests网页访问利器

安装

重写第一个爬虫程序

BeautifulSoup 网页解析利器

安装

BeautifulSoup实战

猜你喜欢

热点阅读

`Requests`网页访问利器

`BeautifulSoup` 网页解析利器

`BeautifulSoup`实战