Python：爬取encyclopedia.thefreedic

2020-03-03 本文已影响0人树懒吃糖_

2020.03.03

目的：
从wikipedia 中检索关键词，并且将网页文本内容保存。由于wikipedia 无法打开，选择类似网址替代，网址如下：
https://encyclopedia2.thefreedictionary.com，已人工核对该网站中出现的信息比wikipedia丰富，从开发源代码看会链接wikipedia信息
待检索关键词：约200万个

chrome浏览器中安装谷歌访问助手后，可以成功打开wikipedia。
插件安装指南链接：https://www.jianshu.com/p/47aeb966623e
按照链接步骤，本人在使用过程中未遇到任何问题，但是推荐给朋友时，安装后界面跳出需要激活，只能暂时使用。（不知道为啥有这种差异）

第一版(v1)测试环境：window7，pycharm，Python3
思路：
1、url 分析，需要爬取的网站使用get 方式，采用传参方式，检索词之间用“+”链接
https://encyclopedia2.thefreedictionary.com/Anthoceros+punctatus
2、解析url返回的html 内容
因为每个网页内容不是特别有规律，现在需要信息也不是很明确，就比较粗暴的提取所有“p”标签内容。
希望一个检索词存储一行记录，对提取的内容进行简单格式整理，如去除所有换行符。
3、对所有的检索词构成列表进行循环

用了100个检索词测试，没有出错，转移到linux服务器，已经检验所有需要的第三包都已经安装，运行出现Error。
需要修改2处（路径除外）
html = request.urlopen(url).read()
BeautifulSoup(html, 'html.parser')

from urllib import request
from bs4 import BeautifulSoup
import os
import time
import random

def get_request(url):
    """open the url and parse the html file"""
    get_result = list()
    try:
        html = request.urlopen(url)
        soup = BeautifulSoup(html, 'lxml')
        items = soup.find_all("p")         #label "p"
        for item in items:
            new = item.get_text().replace('\n', ' ').replace('\r', ' ')
            get_result.append(new)
    except Exception as e:
        print(e)
        get_result.append(repr(e)) 
    return get_result

def search_words(path):
    """search keywords"""
    species = list()
    with open(path) as file:
        for line in file:
            aa = line.split('|')
            specie = aa[1].strip()
            species.append(specie)
    return species


def numerous_dowmload(path):
    file = os.path.split(path)[-1]
    #outer = open(r'/home/dujl/works/07.Microorganism/download_from_encyclopedia/{}_wiki.txt'.format(file), 'w')
    outer = open(r'D:\wiki抓取\download_from_wiki\{}_wiki.txt'.format(file), 'w', encoding="utf-8")
    species = search_words(path)
    for specie in species:
        sw = specie.replace(' ', '+')  #整理后的检索词
        url = '{}/{}'.format(web, sw)
        get_result = get_request(url)

        outer.write(specie + '\t' + '\t'.join(get_result) + '\n')
        outer.flush()
        i = random.uniform(2, 6)
        time.sleep(i)

def running(path):
    #path = r'/home/dujl/works/07.Microorganism/test.txt'
    numerous_dowmload(path)

    print('结束时间: ', time.ctime())
    print("退出程序.")

    return
path = r'D:\wiki抓取\split_part\test.txt'
running(path)

修改后，再次用10000个检索词测试时，就会出现“无法连接另一端服务器”，“urlopen error [errno]10060”，“10056” ，“403 forbidden” 等错误。

第二版（v2）
修改思路：
(1)模拟浏览器
增加headers，headers中只包含user_agent信息就可以使用，但是完善host, connection, referer等信息。随机从user_agent list中选择
user_agent = random.choice(user_agents)
关于模拟浏览器，很多帖子都提到了Selenium，但是都是模拟登入，就暂时不考虑。

(2)设置超时范围
response = request.urlopen(req, timeout=10)
目前比较粗暴的将timeout超出阈值的url 请求保存在log 文件中，完成一轮请求后后续需要重新爬取。
可以写循环，请求多次。

(3) 使用代理ip
目前不太需要，所有没有增加

def get_request(url):
    """open the url and parse the html file"""
    get_result = list()

    #代理IP
    #proxy = {}

    #选择user_agent
    user_agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
                   'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
                   'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                   'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
                   'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',]

    user_agent = random.choice(user_agents)
    headers = {"Host": 'encyclopedia2.thefreedictionary.com',
               "User-Agent": user_agent,
               "Connection": 'Keep-alive',
               "Referer": 'https://encyclopedia2.thefreedictionary.com'
               }

    try:
        req = request.Request(url, headers=headers, method="GET")
        response = request.urlopen(req, timeout=10)
        html = response.read()

        soup = BeautifulSoup(html, 'html.parser')
        items = soup.find_all("p")        #label "p"
        for item in items:
            new = item.get_text().replace('\n', ' ').replace('\r', ' ')
            get_result.append(new)
    except Exception as e:
        print(url)
        print(e)
        get_result.append(repr(e))
    return get_result

爬取大量数据
1.将要下载的url形成列表文件（可以预先将文件分割成若干小文件）；
2.将已下载url记录形成列表文件；
3.出现错误后比较前后两个文件内容，删除重复内容；
4.按照删除重复后的列表文件的继续运行下载程序。
Python 中 list 元素上限：
1----------32位python的限制是 536870912 个元素。
2----------64位python的限制是 1152921504606846975 个元素

在测试过程中犯了一个很蠢的错误，将爬虫脚本投到linux集群上跑。。。。。

Python：爬取encyclopedia.thefreedic

猜你喜欢

热点阅读