爬虫学习1

2017-04-13 本文已影响0人 l_b_n

爬虫的整体结构：

1.由parser(解析器)2.down_loader(下载器)

3.url_manager(url管理器)4.outputer(写入器)

5.spider_main(‘引擎’)

爬虫的运行流程:

流程

1.root_url是根网址，你要爬取的url 最初的一个。

（你要爬百度百科python页面，则root_url = "http://baike.baidu.com/item/python"）

2.1根据提供的root_url，利用url_manager中的add_new_url方法添加url到new_urls中去

2.1.1 url_manager

url_manager

url_manager中有两个set，一个是new一个是old，我们利用new来不断爬；相当于一个url大集合，从中不断的取出然后爬然后继续在新的页面爬取出url都放在这里面；

2.2 通过一个循环while我们判断是否终止的条件是 new_urls 中是否还有url

2.2.1从url_manager 中get_new_url(见代码)

2.2.2利用获取的url来下载页面downloader

downloader

用的是urllib（python3X没有urllib2）直接返回下载的页面；

2.2.3 用下载的内容和url，利用parser来parse出新的url和我们需要的数据

{比如当前的url，解析出的‘内容’，‘标题’}

from bs4 import BeautifulSoup

import re

import urllib

parser

美丽汤使用方法

正则表达式

2.2.4新的url加入到new_urls那个set中去；新的数据加入到datas[]

里面的每一个元素是一个dict{ }

outputer

爬完之后在crawl这个方法中最后output

这里

#from baidubaike_spider import outputer, url_manager, downloader,parser

class SpiderMain(object):

def __init__(self):

self.urls = url_manager.UrlManager()

self.downloader = downloader.DownLoader()

self.parser = parser.Parser()

self.outputer = outputer.OutPuter()

def craw(self, root_url):

self.urls.add_new_url(root_url)

count = 1

while self.urls.has_new_url():#if we had the new url we keep spiding the info

#new_url = self.urls.get_new_url()

#print(new_url)

try:#some url will be changed or fialed to spided we throw the exception

print('kkk')

new_url = self.urls.get_new_url()#get the new info

print('hello')

print('craw %d:%s'%(count,new_url))#count the url we get,and get the number

downloaded_content = self.downloader.download(new_url)#using the downloader for the new content

print('downed')

new_urls,new_data = self.parser.parse(new_url,downloaded_content)

print('parsed')

#using the parser to get the newest url and parse the info to get the data

self.urls.add_new_urls(new_urls)

self.outputer.collect_data(new_data)

count += 1

if count == 10:

break

except:

print('craw fail')

print('hello')

self.outputer.output_info()

print('hhh')

if __name__ == "__main__":

root_url = "http://baike.baidu.com/item/python"

obj_spider = SpiderMain()

obj_spider.craw(root_url)

爬虫学习1

猜你喜欢

热点阅读