Python:爬取encyclopedia.thefreedic
2020-03-04 本文已影响0人
树懒吃糖_
开发者页面源代码和抓取下来的格式不同。
很多<p>标签在源代码中有,但是爬虫的html 中没有。
静态文本?
目前脚本的检索速度为10000条/24h,考虑时间因素调整了检索词,最后只选择了13000条检索词。
分析第一次检索结果,发现很多结果没有将所有‘<p>’标签都抓取下来,发现是headers["user_agent"] 的原因。调整了部分user_agents后,问题解决。 但是暂时还不理解,为什么会出现这种现象。
user_agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',]
user_agent = random.choice(user_agents)
headers = {"Host": 'encyclopedia2.thefreedictionary.com',
"User-Agent": user_agent,
"Connection": 'Keep-alive',
"Referer": 'https://encyclopedia2.thefreedictionary.com',
"Cookie": '_ga=GA1.2.458496055.1563949704; _pubcid=4A31DAFF-C3BC-4277-9E5E-64DD665D9979; c11=guid=07/24/2019 02:28|cn.bing.com%252f|07/24/2019 02:28|02/27/2020 21:12; _ga=GA1.3.458496055.1563949704; c01=track=1&brain=60&2.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0&3.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1&6.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0&2.0=0,0,0,2,1,2,2,1,1,1,1,1,1,2,2,2,2,2,2&3.0=0,0,0,0,2,2,2,2,2,2,2,1,2,2,0,0,0,0,0,2,2,2&5.0=0,0,3,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,3,3,3&6.0=0,0,0,2,1,2,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,0,0,0,0; __gads=ID=753af11087b3965b:T=1582855970:S=ALNI_MaBuflXrpTwFl_eZZD39mZvbPB-IQ; _gid=GA1.3.1548116447.1583039917; _gid=GA1.2.587480635.1583289807',
}
"""
去掉部分user_agents,调整为:
"""
user_agents = ['Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
]