Python爬虫的第一步：从下载网页开始

2018-02-01 本文已影响0人海见

要想先爬取网页，我们首先要做的是把这个网页下载下来，我们使用python urllib2模块来下载一个URL：

```

import urllib2

def download(url) :

return urllib2.url.pen(url).read ()

```

当传入url参数时，该函数将会下载网页并返回其HTML。不过，这个代码片段存在一个问题，即当下载网页时，我们可能会遇到一些无法控制的错误，比如请求的页面可能不存在。此时，urllib2 会抛出异常，然后退出脚本。安全起见，下面再给出一个更健壮的版本，可以捕获这些异常。

```

import urllib2

def download(url) :

print 'Downloading:',url

try:

html = urllib2.urlopen(url) .read()

except urllib2.URLError as e:

print 'Downlad error',e.reason

html = None

return html

```

现在，当出现下错误时，该函数能够捕获异常，然后返回None.

下载的时候遇到的错误经常是临时性的，比如服务器过载时返回的503错误。

互联网工程组定义了http错误的完整列表，可参考https://tools.ietf.org/html/rfc7231#section- 6，从该文档中，我们可以了解到4xx错误发生在请求存在问题时，而5xx错误发生在服务端存在问题，所以我们只要确保download函数发生5xx错误时重试下载即可。

```

def download(url,num_retries=2):

print 'Downloading:',url

try:

html = urllib2.urlopen(url) .read()

except urllib2.URLError as e:

print 'Downlad error',e.reason

html = None

if num_retries > 0:

if hasattr(e,'code') and 500 <= e.code < 600:

return download(url,num_retries-1)

return html

```

现在，如果download函数遇到5XX错误码时，将会递归调用函数自身进行重试，此外，函数还增加了一个参数，用于设定重试下载的次数，默认为2次。

默认的情况下，urllib2使用python-urllib/2.7作为用户代理下载网页内容，如果能使用可辨识的用户代理则更好，这样可以避免我们网络爬虫遇到的一些问题，为了下载更加可靠，我们需要控制用户代理的设定，下面代码对download函数进行了修改，设定一个默认的用户代理”wswp”(既“Scraping with Python”的首字母缩写)

```

def download(url,user_agent="wswp",num_retries=2):

print("Downloading:" + url)

headers = {"User-agent":user_agent}

request = urllib2.Request(url,headers=headers)

try:

html = urllib2.urlopen(request).read()

except urllib2.URLError as e:

print("Download error" + e.reason)

html = None

if num_retries > 0:

if hasattr(e,'code') and 500 <= e.code < 600:

return download(url,user_agent,num_retries-1)

return html

```

最后，如果我们需要对这个文件进行保存，可以使用python的文件操作，完整代码如下：

```

import urllib2

def download(url,user_agent="wswp",num_retries=2):

print("Downloading:" + url)

headers = {"User-agent":user_agent}

request = urllib2.Request(url,headers=headers)

try:

html = urllib2.urlopen(request).read()

except urllib2.URLError as e:

print("Download error" + e.reason)

html = None

if num_retries > 0:

if hasattr(e,'code') and 500 <= e.code < 600:

return download(url,user_agent,num_retries-1)

return html

down = download("https://www.jianshu.com/")

htmlFile = open('jianshu.html',"a")

htmlFile.write(down)

```

是不是很简单呢?

Python爬虫的第一步：从下载网页开始

猜你喜欢

热点阅读