教授花3K购入三个Python爬虫教材, 知乎大佬看完后大喜,
![](https://img.haomeiwen.com/i9305082/1594e45badd90720.jpg)
一、前言
本文的实战内容有:
-
网络小说下载(静态网站)
-
优美壁纸下载(动态网站)
-
爱奇艺VIP视频下载
二、网络爬虫简介
网络爬虫,也叫网络蜘蛛(Web Spider)。它根据网页地址(URL)爬取网页内容,而网页地址(URL)就是我们在浏览器中输入的网站链接。
在讲解爬虫内容之前,我们需要先学习一项写爬虫的必备技能:审查元素
1. 审查元素
![](https://img.haomeiwen.com/i9305082/89241e4a92c326a0.jpg)
![](https://img.haomeiwen.com/i9305082/4f1118ffb1b72565.jpg)
![](https://img.haomeiwen.com/i9305082/312ee50d7113b10e.jpg)
我们可以看到,右侧出现了一大推代码,这些代码就叫做HTML。什么是HTML?举个容易理解的例子:我们的基因决定了我们的原始容貌,服务器返回的HTML决定了网站的原始容貌。
![](https://img.haomeiwen.com/i9305082/313038cfd628406e.jpg)
![](https://img.haomeiwen.com/i9305082/97775698e935a9ad.jpg)
![](https://img.haomeiwen.com/i9305082/10019e8440bff0a1.jpg)
![](https://img.haomeiwen.com/i9305082/a0b3560ab247b2a3.jpg)
再举个小例子:我们都知道,使用浏览器"记住密码"的功能,密码会变成一堆小黑点,是不可见的。可以让密码显示出来吗?可以,只需给页面"动个小手术"!以淘宝为例,在输入密码框处右键,点击检查。
![](https://img.haomeiwen.com/i9305082/5ee775d9a0c06332.jpg)
![](https://img.haomeiwen.com/i9305082/486182938eaa9357.jpg)
就这样,浏览器"记住的密码"显现出来了:
![](https://img.haomeiwen.com/i9305082/a88b0d051fcbd7b8.jpg)
![](https://img.haomeiwen.com/i9305082/b082f49d3fafb5aa.jpg)
2. 简单实例
网络爬虫的第一步就是根据URL,获取网页的HTML信息。在Python3中,可以使用urllib.request和requests进行网页爬取。
-
urllib库是python内置的,无需我们额外安装,只要安装了Python就可以使用这个库。
-
requests库是第三方库,需要我们自己安装。
![](https://img.haomeiwen.com/i9305082/9c5c8533a8ffc84f.jpg)
![](https://img.haomeiwen.com/i9305082/44fb530274218d28.jpg)
![](https://img.haomeiwen.com/i9305082/69f177a19c67cd68.jpg)
![](https://img.haomeiwen.com/i9305082/9ec9e4bd998a1dcd.jpg)
![](https://img.haomeiwen.com/i9305082/b6cff365570f4ac7.jpg)
requests.get()方法必须设置的一个参数就是url,因为我们得告诉GET请求,我们的目标是谁,我们要获取谁的信息。我们将GET请求获得的响应内容存放到req变量中,然后使用req.text就可以获得HTML信息了。运行结果如下:
![](https://img.haomeiwen.com/i9305082/e78adf75340fb1a9.jpg)
![](https://img.haomeiwen.com/i9305082/367ec2008a6cc3a4.jpg)
三、爬虫实战
实战内容由简单到复杂,难度逐渐增加,但均属于入门级难度。下面开始我们的第一个实战内容:网络小说下载。
1. 小说下载
(1)实战背景
![](https://img.haomeiwen.com/i9305082/a6ee097b66122348.jpg)
![](https://img.haomeiwen.com/i9305082/3a358defa6aaf47f.jpg)
![](https://img.haomeiwen.com/i9305082/4bef7b2bba984f9c.jpg)
用已经学到的知识获取HTML信息试一试,编写代码如下:
![](https://img.haomeiwen.com/i9305082/538c235b879e54a0.jpg)
运行Python爬虫代码,可以看到如下结果:
![](https://img.haomeiwen.com/i9305082/006392cb15c61caa.jpg)
![](https://img.haomeiwen.com/i9305082/81b7aa1557f21311.jpg)
(3)Beautiful Soup
![](https://img.haomeiwen.com/i9305082/e07722321fa35434.jpg)
Beautiful Soup的安装方法和requests一样,使用如下指令安装(也是二选一):
-
pip install beautifulsoup4
-
easy_install beautifulsoup4
![](https://img.haomeiwen.com/i9305082/e7041cba2bca8cf5.jpg)
现在,我们使用已经掌握的审查元素方法,查看一下我们的目标页面,你会看到如下内容:
![](https://img.haomeiwen.com/i9305082/f9c6a3ebdcae0d0c.jpg)
![](https://img.haomeiwen.com/i9305082/fc0b5419e3b44d77.jpg)
![](https://img.haomeiwen.com/i9305082/a190bd9ea49824ee.jpg)
![](https://img.haomeiwen.com/i9305082/3ecaa5f3449dc321.jpg)
知道这个信息,我们就可以使用Beautiful Soup提取我们想要的内容了,编写代码如下:
![](https://img.haomeiwen.com/i9305082/b149767d87fd3ae1.jpg)
![](https://img.haomeiwen.com/i9305082/5edb11a15fa07fc5.jpg)
![](https://img.haomeiwen.com/i9305082/3e4fc45c87ea063a.jpg)
![](https://img.haomeiwen.com/i9305082/0fc48d81e29a5e40.jpg)
我们可以看到,我们已经顺利匹配到我们关心的正文内容,但是还有一些我们不想要的东西。比如div标签名,br标签,以及各种空格。怎么去除这些东西呢?我们继续编写代码:
![](https://img.haomeiwen.com/i9305082/24e3f82eb8d99018.jpg)
![](https://img.haomeiwen.com/i9305082/eb302d35f2448074.jpg)
![](https://img.haomeiwen.com/i9305082/a742e6ec1b8ac519.jpg)
程序运行结果如下:
![](https://img.haomeiwen.com/i9305082/74e37c6449237e68.jpg)
![](https://img.haomeiwen.com/i9305082/42dc1004b101da11.jpg)
![](https://img.haomeiwen.com/i9305082/5497882142adb8ed.jpg)
![](https://img.haomeiwen.com/i9305082/0a677242ab506fa0.jpg)
![](https://img.haomeiwen.com/i9305082/7b48219df4d2b89e.jpg)
![](https://img.haomeiwen.com/i9305082/b80d2158997a3f22.jpg)
![](https://img.haomeiwen.com/i9305082/ea92a0948dddfb61.jpg)
![](https://img.haomeiwen.com/i9305082/542b894e29d9f7a5.jpg)
我们将之前获得的第一章节的URL和<a> 标签对比看一下:
![](https://img.haomeiwen.com/i9305082/09acc7a4037f54c9.jpg)
![](https://img.haomeiwen.com/i9305082/0bc821783c776fba.jpg)
![](https://img.haomeiwen.com/i9305082/e8a7a8dfcae839ee.jpg)
还是使用find_all方法,运行结果如下:
![](https://img.haomeiwen.com/i9305082/2f2222fcda12ba6a.jpg)
![](https://img.haomeiwen.com/i9305082/fb9ef537e47995c4.jpg)
方法很简单,对Beautiful Soup返回的匹配结果a,使用a.get('href')方法就能获取href的属性值,使用a.string就能获取章节名,编写代码如下:
![](https://img.haomeiwen.com/i9305082/aa678e15cd9ec777.jpg)
![](https://img.haomeiwen.com/i9305082/eb52b6852951022d.jpg)
![](https://img.haomeiwen.com/i9305082/64acac91b4ddcd07.jpg)
![](https://img.haomeiwen.com/i9305082/c558668280d3ba4d.jpg)
(3)整合代码
每个章节的链接、章节名、章节内容都有了。接下来就是整合代码,将获得内容写入文本文件存储就好了。编写代码如下:
![](https://img.haomeiwen.com/i9305082/916f0387f0c68c26.jpg)
很简单的程序,单进程跑,没有开进程池。下载速度略慢,喝杯茶休息休息吧。代码运行效果如下图所示:
![](https://img.haomeiwen.com/i9305082/bf214ed8b604a32c.gif)
GIF
2. 优美壁纸下载
(1)实战背景
![](https://img.haomeiwen.com/i9305082/0618aebee77fc7b5.jpg)
![](https://img.haomeiwen.com/i9305082/ab5dfd44d6f30ebf.jpg)
![](https://img.haomeiwen.com/i9305082/31df3aa8b30fc0bd.jpg)
(2)实战进阶
![](https://img.haomeiwen.com/i9305082/729837359d5bf599.jpg)
![](https://img.haomeiwen.com/i9305082/96ee0da64f4f9db3.jpg)
![](https://img.haomeiwen.com/i9305082/6d748379d6a437ee.jpg)
那么,让我们先捋一捋这个过程:
-
使用requeusts获取整个网页的HTML信息;
-
使用Beautiful Soup解析HTML信息,找到所有<img>标签,提取src属性,获取图片存放地址;
-
根据图片存放地址,下载图片。
我们信心满满地按照这个思路爬取Unsplash试一试,编写代码如下:
![](https://img.haomeiwen.com/i9305082/90ac33e29d81d044.jpg)
![](https://img.haomeiwen.com/i9305082/44e957c4ea31ab92.jpg)
![](https://img.haomeiwen.com/i9305082/1f7b8b3c0c455d08.jpg)
答案就是,这个网站的所有图片都是动态加载的!网站有静态网站和动态网站之分,上一个实战爬取的网站是静态网站,而这个网站是动态网站,动态加载有一部分的目的就是为了反爬虫。
![](https://img.haomeiwen.com/i9305082/580d038a872bf1d0.jpg)
动态网站使用动态加载常用的手段就是通过调用JavaScript来实现的。怎么实现JavaScript动态加载,我们不必深究,我们只要知道,动态加载的JavaScript脚本,就像化妆术需要用的化妆品,五花八门。有粉底、口红、睫毛膏等等,它们都有各自的用途。动态加载的JavaScript脚本也一样,一个动态加载的网站可能使用很多JavaScript脚本,我们只要找到负责动态加载图片的JavaScript脚本,不就找到我们需要的链接了吗?
![](https://img.haomeiwen.com/i9305082/0f7081c8e1df82ca.jpg)
![](https://img.haomeiwen.com/i9305082/44a624f829549cf9.jpg)
![](https://img.haomeiwen.com/i9305082/dd8cb2a1c938e9c7.jpg)
![](https://img.haomeiwen.com/i9305082/b80842137111a3e1.jpg)
![](https://img.haomeiwen.com/i9305082/962ac10bd16c7f50.jpg)
![](https://img.haomeiwen.com/i9305082/76c1a90152d93f76.jpg)
![](https://img.haomeiwen.com/i9305082/611393c3b02733ac.jpg)
![](https://img.haomeiwen.com/i9305082/363e0debbd3efe22.jpg)
![](https://img.haomeiwen.com/i9305082/54e7edfdbfa8ac0a.jpg)
![](https://img.haomeiwen.com/i9305082/4de8dc74ac4e6703.jpg)
![](https://img.haomeiwen.com/i9305082/25e76b8561376567.jpg)
通过Fiddler抓包,我们发现,点击不同图片的下载按钮,GET请求的地址都是不同的。但是它们很有规律,就是中间有一段代码是不一样的,其他地方都一样。中间那段代码是不是很熟悉?没错,它就是我们之前抓包分析得到json数据中的照片的id。我们只要解析出每个照片的id,就可以获得图片下载的请求地址,然后根据这个请求地址,我们就可以下载图片了。那么,现在的首要任务就是解析json数据了。
![](https://img.haomeiwen.com/i9305082/a8f77e464271ea39.jpg)
编写代码,尝试获取json数据:
![](https://img.haomeiwen.com/i9305082/6b3607e0207db027.jpg)
![](https://img.haomeiwen.com/i9305082/8bb3defed769bd14.jpg)
![](https://img.haomeiwen.com/i9305082/f3c02222fa53f002.jpg)
有想法就要尝试,编写代码如下:
![](https://img.haomeiwen.com/i9305082/43a6ed10349331fd.jpg)
认证问题解决了,又有新问题了:
![](https://img.haomeiwen.com/i9305082/11461a15012a3206.jpg)
可以看到,我们GET请求又失败了,这是为什么?这个网站反爬虫的手段除了动态加载,还有一个反爬虫手段,那就是验证Request Headers。接下来,让我们分析下这个Requests Headers:
![](https://img.haomeiwen.com/i9305082/62a471091d48a3d3.jpg)
![](https://img.haomeiwen.com/i9305082/3aa73f248762f168.jpg)
![](https://img.haomeiwen.com/i9305082/10e96945755124fe.jpg)
![](https://img.haomeiwen.com/i9305082/3f054e08a79a17f9.jpg)
![](https://img.haomeiwen.com/i9305082/36eff2f9b17d591d.jpg)
![](https://img.haomeiwen.com/i9305082/f16aab54a4072ed2.jpg)
![](https://img.haomeiwen.com/i9305082/791a8d11feae1592.jpg)
headers参数值是通过字典传入的。记得将上述代码中your Client-ID换成诸位自己抓包获得的信息。代码运行结果如下:
![](https://img.haomeiwen.com/i9305082/1f8ee6d7601e4b81.jpg)
![](https://img.haomeiwen.com/i9305082/7acc9aee48883c95.jpg)
![](https://img.haomeiwen.com/i9305082/255d653b20ff8df7.jpg)
解析json数据很简单,跟字典操作一样,就是字典套字典。json.load()里面的参数是原始的json格式的数据。程序运行结果如下:
![](https://img.haomeiwen.com/i9305082/be95bd7c32158a8a.jpg)
![](https://img.haomeiwen.com/i9305082/2ebca0919cc896b2.jpg)
![](https://img.haomeiwen.com/i9305082/3d28a099ae98461d.jpg)
下载速度还行,有的图片下载慢是因为图片太大。可以看到右侧也打印了一些警报信息,这是因为我们没有进行SSL验证。
![](https://img.haomeiwen.com/i9305082/ab655ccaac5e21f9.jpg)
学会了爬取图片,简单的动态加载的网站也难不倒你了。赶快试试国内的一些图片网站吧!
3. 爱奇艺VIP视频下载
(1)实战背景
![](https://img.haomeiwen.com/i9305082/ddab3c1531c903fa.jpg)
![](https://img.haomeiwen.com/i9305082/8303223dd69a5a5f.jpg)
![](https://img.haomeiwen.com/i9305082/8d6274b73205e69e.jpg)
![](https://img.haomeiwen.com/i9305082/937137208638fd14.jpg)
这样,我们就可以在线观看这些VIP视频了:
![](https://img.haomeiwen.com/i9305082/3abac4b66570d15a.jpg)
但是这个网站只提供了在线解析视频的功能,没有提供下载接口,如果想把视频下载下来,我们就可以利用网络爬虫进行抓包,将视频下载下来。
(2)实战升级
分析方法相同,我们使用Fiddler进行抓包:
![](https://img.haomeiwen.com/i9305082/1bd8f603fa131fbb.jpg)
我们可以看到,有用的请求并不多,我们逐条分析。我们先看第一个请求返回的信息。
![](https://img.haomeiwen.com/i9305082/6ff9df395498974f.jpg)
![](https://img.haomeiwen.com/i9305082/03abd4fc723cbf07.jpg)
![](https://img.haomeiwen.com/i9305082/eb62c4fb08af0f0e.jpg)
![](https://img.haomeiwen.com/i9305082/24308059586aa826.jpg)
![](https://img.haomeiwen.com/i9305082/f579ddb0ae962278.jpg)
![](https://img.haomeiwen.com/i9305082/f2fb59b5bc173015.jpg)
![](https://img.haomeiwen.com/i9305082/fd04e2dcac58b813.jpg)
这个信息有转义了,但是没有关系,我们手动提取一下,变成如下形式:
![](https://img.haomeiwen.com/i9305082/d626393a4f8be8d5.jpg)
我们已经知道了这个解析视频的服务器的域名,再把域名加上:
![](https://img.haomeiwen.com/i9305082/4eeff75d3076fb49.jpg)
![](https://img.haomeiwen.com/i9305082/670e1bfee2d61909.jpg)
![](https://img.haomeiwen.com/i9305082/8bd427715492cf5b.jpg)
![](https://img.haomeiwen.com/i9305082/7353085a5fb832e9.jpg)
![](https://img.haomeiwen.com/i9305082/2ee3b6f91be6ebc5.jpg)
再打开这个视频地址:
![](https://img.haomeiwen.com/i9305082/04ac38dbf5361026.jpg)
![](https://img.haomeiwen.com/i9305082/4d33c911a7595b48.jpg)
接下来,我们的任务就是编程实现我们所分析的步骤,根据不同的视频播放地址获得视频存放的地址。
现在梳理一下编程思路:
![](https://img.haomeiwen.com/i9305082/bd3fea3d4cbcebe9.jpg)
(3)编写代码
![](https://img.haomeiwen.com/i9305082/ada699e3e2f36236.jpg)
![](https://img.haomeiwen.com/i9305082/b16d722e2e7b65c3.jpg)
![](https://img.haomeiwen.com/i9305082/5d7e4987c3acb628.jpg)
思路已经给出,希望喜欢爬虫的人可以在运行下代码之后,自己重头编写程序,因为只有经过自己分析和测试之后,才能真正明白这些代码的意义。上述代码运行结果如下:
![](https://img.haomeiwen.com/i9305082/7eb708fdcadcabb5.jpg)
![](https://img.haomeiwen.com/i9305082/2d93bcd82b2faf43.jpg)
![](https://img.haomeiwen.com/i9305082/106ca9c60eec68d5.jpg)
![](https://img.haomeiwen.com/i9305082/a3b51b17ff6bbeea.jpg)
![](https://img.haomeiwen.com/i9305082/4eb7a935d973485e.jpg)
![](https://img.haomeiwen.com/i9305082/e414edd672616f23.jpg)
![](https://img.haomeiwen.com/i9305082/348b17e0e5203f41.jpg)
![](https://img.haomeiwen.com/i9305082/2c262f1417c48231.gif)
GIF