Python:爬取encyclopedia.thefreedic

2020-03-04  本文已影响0人  树懒吃糖_

开发者页面源代码和抓取下来的格式不同。
很多<p>标签在源代码中有,但是爬虫的html 中没有。

静态文本?

目前脚本的检索速度为10000条/24h,考虑时间因素调整了检索词,最后只选择了13000条检索词。
分析第一次检索结果,发现很多结果没有将所有‘<p>’标签都抓取下来,发现是headers["user_agent"] 的原因。调整了部分user_agents后,问题解决。 但是暂时还不理解,为什么会出现这种现象。

user_agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
                   'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
                   'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                   'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
                   'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',]

    user_agent = random.choice(user_agents)
    headers = {"Host": 'encyclopedia2.thefreedictionary.com',
               "User-Agent": user_agent,
               "Connection": 'Keep-alive',
               "Referer": 'https://encyclopedia2.thefreedictionary.com',
               "Cookie": '_ga=GA1.2.458496055.1563949704; _pubcid=4A31DAFF-C3BC-4277-9E5E-64DD665D9979; c11=guid=07/24/2019 02:28|cn.bing.com%252f|07/24/2019 02:28|02/27/2020 21:12; _ga=GA1.3.458496055.1563949704; c01=track=1&brain=60&2.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0&3.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1&6.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0&2.0=0,0,0,2,1,2,2,1,1,1,1,1,1,2,2,2,2,2,2&3.0=0,0,0,0,2,2,2,2,2,2,2,1,2,2,0,0,0,0,0,2,2,2&5.0=0,0,3,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,3,3,3&6.0=0,0,0,2,1,2,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,0,0,0,0; __gads=ID=753af11087b3965b:T=1582855970:S=ALNI_MaBuflXrpTwFl_eZZD39mZvbPB-IQ; _gid=GA1.3.1548116447.1583039917; _gid=GA1.2.587480635.1583289807',
               }

"""
去掉部分user_agents,调整为: 
"""
user_agents = ['Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
                  'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                  ]
上一篇下一篇

猜你喜欢

热点阅读