python2.7x windows版本爬虫小白入门

2017-03-20 本文已影响0人西瓜源又源

上周开始学习python爬虫，这里做个记录。

1. 爬虫的库：使用的是python自带的urllib2库，直接通过urllib2.urlopen()函数打开网页，使用read即得print到html网页的源码。添加异常处理，使用hasattr(e, 'code')打印出错误的原因和编号。

class gethsdes: # get information from hsdes

def gethtml(self,url):

#req = urllib2.Request(url)

try:

response = urllib2.urlopen(url)

except URLError, e:

if hasattr(e, 'code'):

print "The server couldn\'t fulfill the request."

print 'Error code: ', e.code

elif hasattr(e, 'reason'):

print 'we failed to reach a server'

print 'Reason: ', e.reason

else:

print 'No exception was raised.'

print e.reason

#print response.geturl()

#print 'Info():'

#print response.info()

html = response.read()

#print the_page

return html

2. 代理：在运行代码的过程中发现由于使用的是公司的网络，访问被屏蔽了，出现了10060这样的错误代码。使用urllib2.ProxyHandler函数添加proxy，urllib2.build_opener()创建了代理，最后通过urllib2.install_opener()函数创建了永久性的代理，在本代码中，不需再次添加代理。

class Proxy:

def __init__(self):

proxy_support = urllib2.ProxyHandler({"http" : "http://XXX:YYY"}) #YYY是端口号

opener = urllib2.build_opener(proxy_support)

urllib2.install_opener(opener)

3. html知识补充

nobr标签：禁止换行

tr标签和td标签：若干行和每行的若干单元格

div标签：表示区域，其中的role属性，告知其作用，并无实际效果，增加代码的可读性。style属性规定了元素的行内css样式。

target属性：共同使用同一target属性的标签，会在同一页面中打开。

4.待补充

python2.7x windows版本爬虫小白入门

猜你喜欢

热点阅读