用python写网络爬虫一

2017-03-22 本文已影响0人枫灬叶

书名:《用python写网络爬虫》，通过阅读并记录去学习，如果文章有什么错误的地方还希望指正
本文参考了http://blog.csdn.net/u014134180/article/details/55506864

1、背景调研

1.1检查robots.txt

大多数网站都会定义robots.txt文件，这样可以让爬虫了解爬取该网站
时存在哪些限制。这些限制虽然仅仅作为建议给出

url+"/robot.txt"的方式在浏览器中进行访问
例：http://www.jianshu.com/robot.txt

1.2检查网站地图

网站提供的Sitemap 文件（即网站地图）可以帮助爬虫定位网站最新的内容
一般在robots.txt给出

例： http://example.webscraping.com/sitemap.xml

1.3 估算网站的大小

使用site 关键词对我们的示例网站进行搜索的结果，即在
Google 中搜索 site: exam ple. webscra ping. com o

1.4检查网站所用技术

使用builtwith 模块

import builtwith
strd=builtwith.parse('http://www.jianshu.com')

1.5 寻找网站所有者

使用whois模块

 import whois
 print whois.whois('jianshu.com')

2编写第一个网络爬虫

抓取网站，首先需要下载包含感兴趣的数据的网页，这过程被称为爬（crawling）
常用的爬取网站的常见方法：

爬取网站地图
遍历每个网友的数据库ID
跟踪网页链接

2.1下载网页

使用requests模块下载
关于requests的使用
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
下载网页

import requests
def download(url):
    print('Downloading:', url)
    html =requests.get(url).text
    return html
#print(download1('https://www.jianshu.com'))

1重试下载
当服务器过载返回503 Service Unavailable错误，我们可以尝试重新下载。如果是404 Not Found这种错误，说明网页目前并不存在，尝试两次请求也不会有不同结果。

import requests
def download(url,num_retries=2):
    response = requests.get(url)
    if num_retries > 0:
        if 500 <=response.status_code < 600:
            return download(url, num_retries-1)
    return response.text

2.设置用户代理（user_agent）
一些网站还会封禁这个默认用户代理。所以我们要重新设置用户代理

import requests
def download(url, user_agent='jians', num_retries=2):
    headers = {'User-agent': 'my-app/0.0.1'}
    response = requests.get(url, headers=headers)
    if num_retries > 0:
        if 500 <=response.status_code < 600:
            return download(url, num_retries-1)
    return response.text

2.2网站地图爬虫

在第一个简单的爬虫中，我们将使用示例网站 robots. tx t文件中发现的网站地图来下载所有网页。为了解析网站地图，我们将会使用一个简单的
正则表达式，从＜ loc ＞标签中提取出URL。而在
下一章中，我们将会介绍一种更加健壮的解析方法一－ css 选择器

# 网站地图爬虫，需要有sitemap.xml文件
def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        print link
#调用crawl_sitemap
# url_str = "http://example.webscraping.com/sitemap.xml"
# crawl_sitemap(url_str)

2.3 ID遍历爬虫

由于这些URL只有后缀不同，所有我们可以遍历ID下载所有国家页面。

def id_by_download():
    # 实现 的爬虫连续5次下载错误才会停止遍历，这样很大
    # 程度降低了遇到被删除记录时过早停止遍历的风险
    max_errors = 5
    num_errors = 0
    for page in range(20):
        url = 'http://example.webscraping.com/view/-%d' % page
        html = download(url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                break
        else:
            num_errors = 0

不足：在爬取网站时，遍历ID是一个很便捷的方法，但是和网站地图爬虫一样，
这种方法也无法保证始终可用。比如，一些网站会检查页面别名是否满足预
期，如果不是，则会返回 404 Not Found 错误。而另一些网站
则会使用非连续大数作为ID ，或是不使用数值作为ID，此时遍历就难以发挥其作用了。

2.4 链接爬虫

通过跟踪所有链接的方式，我们可以很容易地下载整个网站的页面。但是，
这种方法会下载大量我们并不需要的网页

import re
def link_crawler(seed_url, link_regex):
    '''craw from the given seed URL following links matched by link_regex'''
    crawl_quene = [seed_url]
    while crawl_quene:
        url = crawl_quene.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_quene.append(link)
def get_links(html):
    # a regular expression to extract all link from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all link from the webpage
    return webpage_regex.findall(html)
#link_crawler('http://example.webscraping.com/', '/(index|view)')

由于/index/1是相对链接，浏览器可以识别，但urllib2无法知道上下文，所有我们可以用urlparse模块来转换为绝对链接。修改代码link_crawler方法,导入

import urlparse
import re
def link_crawler(seed_url, link_regex):
    '''craw from the given seed URL following links matched by link_regex'''
    crawl_quene = [seed_url]
    while crawl_quene:
        url = crawl_quene.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url,link)
                crawl_quene.append(link)

避免重复爬取相同的链接，我们需要记录哪些
链接己经被爬取过。下面是修改后的 link crawler函数，己具备存储己发
现URL 的功能，可以避免重复下载，
修改修改代码link_crawler方法

import urlparse
import re
def link_crawler(seed_url, link_regex):
    '''craw from the given seed URL following links matched by link_regex'''
    crawl_quene = [seed_url]
    seen = set(crawl_quene)
    while crawl_quene:
        url = crawl_quene.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url,link)
                if link not in seen:
                    crawl_quene.append(link)```
###高级功能
####解析robots.txt
robotparser模块首先加载robots.txt文件，然后通过can_fetch()函数确定指定的用户代理是否允许访问网页,通过python 自带robotparser解析robots.txt文件

import robotparser
def get_robots(url):
# 解析robots.txt
rp = robotparser.RobotFileParser()
rp.set_url(urlparse.urljoin(url, "/robots.txt"))
rp.read()
return rp
url = 'http://example.webscraping.com'
rp = get_robots(url)
user_agent = 'GoodCrawler'
rp.can_fetch(user_agent, url)

####下载限速
如果我们 爬 取 网站 的速度过快，就会 面 临 被封禁或是造成服务 器过载的 风
险。为了降低这些风险，我们可以在两次下载之间添加 延 时，从而对爬 虫 限
速。下面是实现了该功能的 类的代码。

class Throttle:
'''add a delay between downloads to the same domain'''
def init(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
self.domain = {}

def wait(self, url):
    domain = urlparse.urlparse(url).netloc
    last_accessed = self.domain.get(domain)
    if self.delay > 0 and last_accessed is not None:
        sleep_secs = self.delay - (datetime.datetime.now()-last_accessed).seconds
        if sleep_secs > 0:
            time.sleep(sleep_secs)
    self.domain[domain] = datetime.datetime.now()

Throttle类记录每个上次访问的时间，如果当前时间距离上次访问时间小于指定延

时，则执行睡眠操作。我们可以在每次下载之前调用Throttle对爬虫进行限速。

throttle = Throttle(delay)

throttle.wait(url)

result =download(url,headers,proxy=proxy,num_retries=num_retries)

####避免爬虫陷阱
目前，我们的爬虫会跟踪所有之前没有访问过的链接。但是，一些网站 会
动态 生 成页面内容， 这样就会出现无限多 的网 页。 比如， 网站 有一个在线日
历功能，提供 了 可以访问下个月和 下 一年 的 链接，那么下个月 的页面 中同样
会 包含访问再下个月 的 链接， 这样页面就会 无止境地链接下 去。 这种 情况被
称为 爬虫 陷阱。
想要避免陷入 爬虫陷阱，一个简单的方 法是记录到达当前网 页经过 了多少
个链接， 也就是深度。 当到达最大深度 时 ， 爬虫就不 再 向队列 中 添加该网 页
中的 链接了。 要实现这一功能， 我们需 要 修改 see n 变量。
该 变量原先只记录访问过的网页链接， 现在修改为一个字典， 增 加了页面深度 的 记录。
修改link_crawler方法

def link_crawler(seed_url, link_regex,max_depth=2):
'''craw from the given seed URL following links matched by link_regex'''
crawl_queue = [seed_url]
seen = {seed_url: 0}
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
depath = seen[url]
if depath != max_depth:
for link in get_links(html):
if re.match(link_regex, link):
link = urlparse.urljoin(seed_url, link)
if link not in seen:
seen[link] = depath+1
crawl_queue.append(link)
else:
print('Bldoked by robots.txt:', url)

####最终版本
这个高级链接爬虫的完整 源代码可以在 https: //bitbucket. org/
wswp/code /src/t ip/chapter01/link_cra wler3.p y下载得到