简单抓站的N种方式（二）-requests与re

2017-06-29 本文已影响35人周且南_laygin

这里requests与urllib是相同的功能，主要是为了得到网页源文件，个人觉得requests比较强大，毕竟是第三方库（仁者见仁吧）。本文主要是记录及分享一下使用正则来匹配网页内容，并且通过测试发现正则的匹配速度比bs4快一些。

1 本来想介绍下正则，但是 SHOW ME THE CODE

抓取一个英文谚语网站Quotes to Scrape的谚语、作者和标签，返回一个生成器，元素是字典。
def getQuote(url): import requests import re header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'} try: resp = requests.get(url,headers=header) quotes_p = re.compile(r'<div class="quote".*?>(.*?)</div>',re.S) #all quotes text_p = re.compile(r'<span class="text".*?>(.*?)</span>') author_p = re.compile(r'<small class="author".*?>(.*?)</small>') tags_p = re.compile(r'<a class="tag".*?>(.*?)</a>') #tags,more than one,a list quotes = re.findall(quotes_p,resp.text) #list quotedct = {} for quote in quotes: # get this page all info quotedct['text'] = re.search(text_p, quote).group(1) quotedct['author'] = re.search(author_p, quote).group(1) quotedct['tages'] = re.findall(tags_p, quote) yield quotedct except Exception as e: print(e)

2 简要说明

quotes_p = re.compile(r'<div class="quote".*?>(.*?)</div>',re.S)，分析原网页可知，所有需要的信息都在一个div中，class为quote，最后加上标志re.S是因为这个div中有换行，当我使用.匹配的时候需要忽略换行，故加上。
text_p = re.compile(r'<span class="text".*?>(.*?)</span>')，这是每个quote的文本部分。
author_p = re.compile(r'<small class="author".*?>(.*?)</small>')，这是每个quote的作者。
tags_p = re.compile(r'<a class="tag".*?>(.*?)</a>')，这是每个quote的标签，由于有多个，最后匹配时需要使用findall，返回的是一个列表。

1 最后将所有结果放在一个字典中，通过yield返回。
2 如果需要抓取所有页面，再新增一个参数传入页面，构造URL即可。
3 这些都是比较简单的爬虫，网页上可直接得到需要的内容，无需太多技巧；如果是封装到json里的数据还需要抓包找到url再进行解析。
4 只要是发起请求就使用自定义header，习惯使然吧。
5 由于平时需要使用代理，简单弄了一个免费的代理IP接口，主要是使用正则，故贴上，当然有用的IP很少，且不太稳定，毕竟是免费的啊，欢迎踩：https://github.com/opconty/FreeProxyIP

3 结果

测试代码如下：
if __name__ == '__main__': import pprint url = 'http://quotes.toscrape.com/tag/humor/' quotedct = getQuote(url) for quote in quotedct: pprint.pprint(quote)

results

简单抓站的N种方式（二）-requests与re

1 本来想介绍下正则，但是 SHOW ME THE CODE

2 简要说明

3 结果

猜你喜欢

热点阅读

简单抓站的N种方式（二）-requests与re

1 本来想介绍下正则，但是 ** SHOW ME THE CODE **

2 简要说明

3 结果

猜你喜欢

热点阅读

1 本来想介绍下正则，但是 SHOW ME THE CODE