二十一. 多进程爬虫
2018-02-21 本文已影响0人
橄榄的世界
1.多进程的使用方法如下:
from multiprocessing import Pool #引入进程池模块
pool = Pool(processes=4) #创建进程池
pool.map(func,iterable[,chunksize]) #map()函数运行进程,func为需运行的函数,iterable为迭代参数。
2.多进程的性能对比:
爬取网址:https://www.qiushibaike.com/text/
爬取内容:用户ID、发表段子文字信息、好笑数量、评价数量
爬取方式:正则表达式
性能对比:(单线程,2进程,4进程)比较运行时间
import requests
import re
from multiprocessing import Pool
import time
'''解析内容,并创建多线程执行,存入CSV文档中'''
hds = {'User-Agent': 'ozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def scraper(url):
r = requests.get(url,headers = hds)
ids = re.findall("<h2>(.*?)</h2>",r.text,re.S)
contents = re.findall('<div class="content">.*?<span>(.*?)</span>',r.text,re.S)
laughs = re.findall('<span class="stats-vote">.*?<i class="number">(.*?)</i>',r.text,re.S)
comments = re.findall('<span class="stats-comments">.*?<i class="number">(.*?)</i>',r.text,re.S)
for id,content,laugh,comment in zip(ids,contents,laughs,comments):
info = {
'id':id,
'content':content,
'laugh':laugh,
'comment':comment
}
return info
if __name__ == '__main__': #多进程必须使用此语句,否则报错
'''构建网页列表并将其入队'''
url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)]
hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
start1 = time.time()
for url in url_list:
scraper(url)
stop1 = time.time()
print("单线程耗时", stop1 - start1)
start2 = time.time()
pool = Pool(processes=2)
pool.map(scraper,url_list)
stop2 = time.time()
print("2个进程耗时", stop2 - start2)
start3 = time.time()
pool = Pool(processes=2)
pool.map(scraper,url_list)
stop3 = time.time()
print("4个进程耗时", stop3 - start3)
结果为:
单线程耗时 2.73215651512146
2个进程耗时 2.477141857147217
4个进程耗时 2.3501341342926025