python3编程之美-Pyhon

python爬虫

2018-07-22  本文已影响0人  icessun

简单网页爬虫

import urllib.request
response = urllib.request.urlopen('http://gitbook.cn')
html=response.read()
print(html)

上面的代码是一个简单的网页爬虫代码:

模仿浏览器访问

把上面访问的网址换成百度的网址,发现爬取不到内容。因为有些网站会做一些预防,访问该网站如果不是通过浏览器,那么就会读取不到信息。


baidu的结果
import urllib.request
response = urllib.request.Request('https://www.baidu.com/')
response.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')
html= urllib.request.urlopen(response)
file = html.read()
print(file)

对网页信息进行编码

file = html.read().decode('utf-8')
print(file)

这样就方便人去阅读爬取的结果

Requests网页访问利器

安装

打开CMD控制台输入

pip install requests
检测安装成功
重写第一个爬虫程序
import requests
url='http://gitbook.cn'
web_data = requests.get(url)
web_info = web_data.text
print(web_info)

这段代码也能够达到我们需要的效果。

header = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,   */*;q=0.8',
            'Accept-Language':'zh-CN,zh;q=0.9',
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
           }
   r = requests.get("http://gitbook.cn/", headers=headers)
import requests
url = 'https://mp.csdn.net/'
header = {
    'Cookie':'此处隐藏个人Cookie',
    'User-Agent' :' '
}
web_data = requests.get(url,header)
web_info = web_data.text
print(web_info)

最终爬取的结果中,已经包含了登录时收藏的相关文章!Cookie设置成功!

BeautifulSoup 网页解析利器

安装

和安装Requests一样,在控制台输入命令:

pip install BeautifulSoup4

BeautifulSoup只是封装了一些常用的功能,解析网页仍然需要一个解析器进行工作,所以还需要安装解析器:

BeautifulSoup中的解析器
# 安装速度较快的解析器:lxml
pip install lxml

改写上面的代码:

from bs4 import BeautifulSoup
import requests
url = "http://gitbook.cn"
header ={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
web_data = requests.get(url,header) # 获取对应url和定制的请求头的网站信息
soup = BeautifulSoup(web_data.text,'lxml') # 传入Requests获取的网页数据,使用lxml解析器去解析
title = soup.title.string  # 获取title标签中的内容: GitChat
urls = soup.find_all('a')  # 找到所有的a标签
description = soup.find(attrs={"name": "description"})
keywords = soup.find(attrs={"name": "keywords"})
title = soup.title.string
print(title)   # 输出 GitChat
description = soup.find(attrs={"name": "description"})['content']
keywords = soup.find(attrs={"name": "keywords"})['content']
BeautifulSoup实战
from bs4 import BeautifulSoup
import requests
url = 'https://gitbook.cn/'
header ={
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
}
web_data = requests.get(url,header)
soup = BeautifulSoup(web_data.text,'lxml')
title = soup.title.string
print(title)  # GitChat
theme = soup.select('#indexRight > div:nth-of-type(4) > a:nth-of-type(1) > span')[0].get_text()
print(theme)   # 前端
text = soup.get_text()    # 获取标签中的内容
class_name = soup.select("path")[0].get("class")  # 获取 class 类名
url_address = soup.select("path")[0].get("href")  # 获取 url 地址
上一篇 下一篇

猜你喜欢

热点阅读