python爬虫--day06

2019-01-08 本文已影响0人陈small末

进程

进程的概念

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n4" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">python中的多线程其实并不是真正的多线程，如果想要充分地使用多核CPU的资源，在python中大部分情况需要使用多进程。

进程的概念：
进程是程序的一次执行过程, 正在进行的一个过程或者说一个任务，而负责执行任务的则是CPU.

进程的生命期：
当操作系统要完成某个任务时，它会创建一个进程。当进程完成任务之后，系统就会撤销这个进程，收回它所占用的资源。从创建到撤销的时间段就是进程的生命期

进程之间存在并发性：
在一个系统中，同时会存在多个进程。他们轮流占用CPU和各种资源

并行与并发的区别：
无论是并行还是并发,在用户看来都是同时运行的，不管是进程还是线程，都只是一个任务而已，
真正干活的是CPU，CPU来做这些任务，而一个cpu（单核）同一时刻只能执行一个任务。
并行：多个任务同时运行，只有具备多个cpu才能实现并行，含有几个cpu，也就意味着在同一时刻可以执行几个任务。
CPU数量 >= 任务数量
并发：是伪并行，即看起来是同时运行的，实际上是单个CPU在多道程序之间来回的进行切换。
CPU数量 < 任务数量

同步与异步的概念：
同步就是指一个进程在执行某个请求的时候，若该请求需要一段时间才能返回信息，那么这个进程将会一直等待下去，直到收到返回信息才继续执行下去。
异步是指进程不需要一直等下去，而是继续执行下面的操作，不管其他进程的状态。当有消息返回时系统会通知进行处理，这样可以提高执行的效率。
比如：打电话的过程就是同步通信，发短信时就是异步通信。

多线程和多进程的关系：
对于计算密集型应用，应该使用多进程;
对于IO密集型应用，应该使用多线程。线程的创建比进程的创建开销小的多。
</pre>

创建进程

使用multiprocessing.Process

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time

def func(arg):
pname = multiprocessing.current_process().name
pid = multiprocessing.current_process().pid
print("当前进程ID=%d,name=%s" % (pid, pname))

for i in range(5):
print(arg)
time.sleep(1)

if name == "main":
p = multiprocessing.Process(target=func, args=("hello",))

p.daemon = True # 设为【守护进程】（随主进程的结束而结束）

p.start()

while True:
print("子进程是否活着？", p.is_alive())
time.sleep(1)
print("main over")
</pre>

通过继承Process实现自定义进程

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n9" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import os

通过继承Process实现自定义进程

class MyProcess(multiprocessing.Process):
def init(self, name, url):
super().init()
self.name = name
self.url = url # 自定义属性

重写run

def run(self):
pid = os.getpid()
ppid = os.getppid()
pname = multiprocessing.current_process().name
print("当前进程name：", pname)
print("当前进程id：", pid)
print("当前进程的父进程id：", ppid)

if name == 'main':

创建3个进程

MyProcess("小分队1", "").start()
MyProcess("小分队2", "").start()
MyProcess("小分队3", "").start()
print("主进程ID：", multiprocessing.current_process().pid)

CPU核数

coreCount = multiprocessing.cpu_count()
print("我的CPU是%d核的" % coreCount)

获取当前活动的进程列表

print(multiprocessing.active_children()) </pre>

同步异步和进程锁

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n11" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time

def fn():
name = multiprocessing.current_process().name
print("开始执行进程：", name)
time.sleep(random.randint(1, 4))
print("执行结束：", name)

多进程

异步执行进程

def processAsync():
p1 = multiprocessing.Process(target=fn, name="小分队1")
p2 = multiprocessing.Process(target=fn, name="小分队2")
p1.start()
p2.start()

同步执行

def processSync():
p1 = multiprocessing.Process(target=fn, name="小分队1")
p2 = multiprocessing.Process(target=fn, name="小分队2")
p1.start()
p1.join()
p2.start()
p2.join()

加锁

def processLock():

进程锁

lock = multiprocessing.Lock()
p1 = multiprocessing.Process(target=fn2, name="小分队1", args=(lock,))
p2 = multiprocessing.Process(target=fn2, name="小分队2", args=(lock,))
p1.start()
p2.start()

def fn2(lock):
name = multiprocessing.current_process().name
print("开始执行进程：", name)

加锁

方式一

if lock.acquire():

print("正在工作...")

time.sleep(random.randint(1, 4))

lock.release()

方式二

with lock:
print("%s:正在工作..." % name)
time.sleep(random.randint(1, 4))

print("%s:执行结束："% name)

if name == 'main':

processAsync() # 异步执行

processSync() # 同步执行

processLock() # 加进程锁
</pre>

使用Semaphore控制进程的最大并发

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n13" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time

def fn(sem):
with sem:
name = multiprocessing.current_process().name
print("子线程开始：", name)
time.sleep(3)
print("子线程结束：", name)

if name == 'main':
sem = multiprocessing.Semaphore(3)
for i in range(8):
multiprocessing.Process(target=fn, name="小分队%d"%i, args=(sem, )).start()
</pre>

练习：多进程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习：多进程+多协程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习：多线程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

练习：多进程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

扩展

线程池

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n21" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import threading
import threadpool

import time
import random

=================================================================

def fn(who):
tname = threading.current_thread().getName()

print("%s开始%s..." % (tname, who))
time.sleep(random.randint(1, 5))
print("-----%s，%s-----" % (tname, who))

=================================================================

请求执行结束回调

request=已完成的请求

result=任务的返回值

def cb(request, result):
print("cb", request, result)

if name == 'main':

创建一个最大并发为4的线程池(4个线程)

pool = threadpool.ThreadPool(4)

argsList = ["张三丰", "赵四", "王五", "六爷", "洪七公", "朱重八"]

允许回调

requests = threadpool.makeRequests(fn, argsList, callback=cb)

for req in requests:
pool.putRequest(req)

阻塞等待全部请求返回（线程池创建的并发默认为【守护线程】）

pool.wait()
print("Over")

</pre>

进程池

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time

def fn1(arg, name):
print("正在执行任务1： {}...".format(arg))
time.sleep(random.randint(1, 5))
print("进程%d完毕！" % (name))

def fn2(arg, name):
print("正在执行任务2： {}...".format(arg))
time.sleep(random.randint(1, 5))
print("进程%d完毕！" % (name))

回调函数

def onback(result):
print("得到结果{}".format(result))

if name == "main":

待并发执行的函数列表

funclist = [fn1, fn2, fn1, fn2]

创建一个3并发的进程池

pool = multiprocessing.Pool(3)

遍历函数列表，将每一个函数丢入进程池中

for i in range(len(funclist)):

同步执行

pool.apply(func=funclist[i], args=("hello", i))

异步执行

pool.apply_async(func=funclist[i], args=("hello", i), callback=onback)

pool.close() # 关闭进程池，不再接收新的进程
pool.join() # 令主进程阻塞等待池中所有进程执行完毕
</pre>

Scrapy 框架介绍

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" contenteditable="true" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛。
Scrapy框架：用户只需要定制开发几个模块就可以轻松的实现一个爬虫，用来抓取网页内容以及各种图片，非常之方便。
Scrapy 使用了Twisted(其主要对手是Tornado)多线程异步网络框架来处理网络通讯，可以加快我们的下载速度，不用自己去实现异步框架，并且包含了各种中间件接口，可以灵活的完成各种需求。</pre>

Scrapy架构图

[图片上传失败...(image-561889-1546908816469)]

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n31" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy主要包括了以下组件：
Scrapy Engine(引擎):
负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。

Scheduler(调度器):
它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。

Downloader（下载器）：
负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理，

Spider（爬虫）：
它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器)，

Item Pipeline(管道)：
它负责处理Spider中获取到的Item，并进行后期处理（详细分析、过滤、存储等）的地方.

Downloader Middlewares（下载中间件）：
你可以当作是一个可以自定义扩展下载功能的组件。

Spider Middlewares（Spider中间件）：
你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入Spider的Responses和从Spider出去的Requests）
</pre>

安装Scrapy

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy的安装介绍
Scrapy框架官方网址：http://doc.scrapy.org/en/latest
Scrapy中文维护站点：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

安装方式:
1、安装wheel
pip install wheel
2、安装lxml
pip install lxml
3、安装pyopenssl
pip install pyopenssl
4、安装Twisted
需要我们自己下载Twisted，然后安装。这里有Python的各种依赖包。选择适合自己Python以及系统的Twisted版本：https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

3.6版本（cp后是python版本）

pip install Twisted-18.9.0-cp36-cp36m-win_amd64.whl

5、安装pywin32
pip install pywin32
6、安装scrapy
pip install scrapy

安装后，只要在命令终端输入scrapy来检测是否安装成功
</pre>

使用Scrapy

使用爬虫可以遵循以下步骤：

创建一个Scrapy项目
定义提取的Item
编写爬取网站的 spider 并提取 Item
编写 Item Pipeline 来存储提取到的Item(即数据)

1. 新建项目(scrapy startproject)

创建一个新的Scrapy项目来爬取 http://www.meijutt.com/new100.html 中的数据，使用以下命令：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy startproject meiju</pre>

创建爬虫程序

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n52" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">cd meiju
scrapy genspider meijuSpider meijutt.com

其中：
meijuSpider为爬虫文件名
meijutt.com为爬取网址的域名</pre>

创建Scrapy工程后, 会自动创建多个文件，下面来简单介绍一下各个主要文件的作用：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy.cfg：
项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py：
设置数据存储模板，用于结构化数据，如：Django的Model
pipelines：
数据处理行为，如：一般结构化的数据持久化
settings.py：
配置文件，如：递归的层数、并发数，延迟下载等
spiders：
爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名</pre>

2. 定义Item

Item是保存爬取到的数据的容器；其使用方法和python字典类似，虽然我们可以在Scrapy中直接使用dict，但是 Item提供了额外保护机制来避免拼写错误导致的未定义字段错误；

类似ORM中的Model定义字段，我们可以通过scrapy.Item 类来定义要爬取的字段。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n58" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import scrapy

class MeijuItem(scrapy.Item):
name = scrapy.Field()</pre>

3. 编写爬虫

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n60" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
import scrapy
from lxml import etree
from meiju.items import MeijuItem

class MeijuspiderSpider(scrapy.Spider):

爬虫名

name = 'meijuSpider'

被允许的域名

allowed_domains = ['meijutt.com']

起始爬取的url

start_urls = ['http://www.meijutt.com/new100.html']

数据处理

def parse(self, response):

response响应对象

xpath

mytree = etree.HTML(response.text)
movie_list = mytree.xpath('//ul[@class="top-list fn-clear"]/li')

for movie in movie_list:
name = movie.xpath('./h5/a/text()')

创建item(类字典对象)

item = MeijuItem()
item['name'] = name
yield item
</pre>

启用一个Item Pipeline组件

为了启用Item Pipeline组件，必须将它的类添加到 settings.py文件ITEM_PIPELINES 配置修改settings.py，并设置优先级，分配给每个类的整型值，确定了他们运行的顺序，item按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内（0-1000随意设置，数值越低，组件的优先级越高）

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n63" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">ITEM_PIPELINES = {
'meiju.pipelines.MeijuPipeline': 300,
}</pre>

设置UA

在setting.py中设置USER_AGENT的值

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n66" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
</pre>

4. 编写 Pipeline 来存储提取到的Item(即数据)

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">class SomethingPipeline(object):
def init(self):

可选实现，做参数初始化等

def process_item(self, item, spider):

item (Item 对象) – 被爬取的item

spider (Spider 对象) – 爬取该item的spider

这个方法必须实现，每个item pipeline组件都需要调用该方法，

这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

return item

def open_spider(self, spider):

spider (Spider 对象) – 被开启的spider

可选实现，当spider被开启时，这个方法被调用。

def close_spider(self, spider):

spider (Spider 对象) – 被关闭的spider

可选实现，当spider被关闭时，这个方法被调用

</pre>

运行爬虫：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n70" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider

nolog模式

scrapy crawl meijuSpider --nolog </pre>

scrapy保存信息的最简单的方法主要有这几种，-o 输出指定格式的文件，命令如下：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n72" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider -o meiju.json
scrapy crawl meijuSpider -o meiju.csv
scrapy crawl meijuSpider -o meiju.xml</pre>

python爬虫--day06

进程

进程的概念

创建进程

使用multiprocessing.Process

p.daemon = True # 设为【守护进程】（随主进程的结束而结束）

通过继承Process实现自定义进程

通过继承Process实现自定义进程

重写run

创建3个进程

CPU核数

获取当前活动的进程列表

同步异步和进程锁

多进程

异步执行进程

同步执行

加锁

进程锁

加锁

方式一

if lock.acquire():

print("正在工作...")

time.sleep(random.randint(1, 4))

lock.release()

方式二

processAsync() # 异步执行

processSync() # 同步执行

使用Semaphore控制进程的最大并发

练习： 多进程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习： 多进程+多协程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习： 多线程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

练习： 多进程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

扩展

线程池

=================================================================

=================================================================

请求执行结束回调

request=已完成的请求

result=任务的返回值

创建一个最大并发为4的线程池(4个线程)

允许回调

阻塞等待全部请求返回（线程池创建的并发默认为【守护线程】）

进程池

回调函数

待并发执行的函数列表

创建一个3并发的进程池

遍历函数列表，将每一个函数丢入进程池中

同步执行

pool.apply(func=funclist[i], args=("hello", i))

异步执行

Scrapy 框架介绍

Scrapy架构图

安装Scrapy

3.6版本（cp后是python版本）

使用Scrapy

使用爬虫可以遵循以下步骤：

1. 新建项目(scrapy startproject)

创建一个新的Scrapy项目来爬取 http://www.meijutt.com/new100.html 中的数据，使用以下命令：

创建爬虫程序

2. 定义Item

3. 编写爬虫

爬虫名

被允许的域名

起始爬取的url

数据处理

response响应对象

xpath

创建item(类字典对象)

启用一个Item Pipeline组件

设置UA

4. 编写 Pipeline 来存储提取到的Item(即数据)

可选实现，做参数初始化等

item (Item 对象) – 被爬取的item

spider (Spider 对象) – 爬取该item的spider

这个方法必须实现，每个item pipeline组件都需要调用该方法，

这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

spider (Spider 对象) – 被开启的spider

可选实现，当spider被开启时，这个方法被调用。

spider (Spider 对象) – 被关闭的spider

可选实现，当spider被关闭时，这个方法被调用

练习：多进程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习：多进程+多协程抓取链家 https://sz.lianjia.com/ershoufang/rs/

练习：多线程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

练习：多进程分页抓取斗鱼妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4