设计一个网络爬虫

2018-06-03 本文已影响8人 MontyOak

设计一个网络爬虫原文链接

可以参考Scrapy的代码实现

1.描述使用场景和约束

使用场景：

抓取url列表
- 生成文字到页面的倒挂索引
- 获取页面标题和摘要
支持用户按照关键字搜索页面

假设和约束：

流量不均衡，可以会出现热点（hot spot）的情况
搜索操作尽量快速
10亿的url链接量
- 页面需要定期更新，更新频率根据网站访问热度决定，平均每周一次，所以每月需要爬取40亿次
- 页面平均大小500kb
平均每月1000亿次的搜索量

容量估算：

每月2pb的页面（500kb * 40亿）
每秒1600次写操作
每秒40000次搜索操作

2. 创建系统设计

系统总体设计图

3. 设计关键组件

使用场景：抓取url列表
假设已有待抓取列表links_to_crawl，可以使用crawled_links表维护已经抓取过的url。links_to_crawl和crawled_links信息可以维护在一个key-value的cache中。

抓取服务的主流程如下：
- 从links_to_crawl获取头部url
  - 如果这个url已经在crawled_links中，重新获取一个url
  - 抓取url信息，推送相关信息到倒挂索引服务生成相关索引
  - 根据页面信息抓取标题，生成摘要
  - 从links_to_crawl中删除这个url，并把它插入crawled_links中

抓取服务的大致方法代码：

class PagesDataStore(object):

    def __init__(self, db);
        self.db = db
        ...

    def add_link_to_crawl(self, url):
        """Add the given link to `links_to_crawl`."""
        ...

    def remove_link_to_crawl(self, url):
        """Remove the given link from `links_to_crawl`."""
        ...

    def reduce_priority_link_to_crawl(self, url)
        """Reduce the priority of a link in `links_to_crawl` to avoid cycles."""
        ...

    def extract_max_priority_page(self):
        """Return the highest priority link in `links_to_crawl`."""
        ...

    def insert_crawled_link(self, url, signature):
        """Add the given link to `crawled_links`."""
        ...

    def crawled_similar(self, signature):
        """Determine if we've already crawled a page matching the given signature"""
        ...

页面的对象设计如下：

class Page(object):

    def __init__(self, url, contents, child_urls, signature):
        self.url = url
        self.contents = contents
        self.child_urls = child_urls
        self.signature = signature

爬取job的代码示意如下：

class Crawler(object):

    def __init__(self, data_store, reverse_index_queue, doc_index_queue):
        self.data_store = data_store
        self.reverse_index_queue = reverse_index_queue
        self.doc_index_queue = doc_index_queue

    def create_signature(self, page):
        """Create signature based on url and contents."""
        ...

    def crawl_page(self, page):
        for url in page.child_urls:
            self.data_store.add_link_to_crawl(url)
        page.signature = self.create_signature(page)
        self.data_store.remove_link_to_crawl(page.url)
        self.data_store.insert_crawled_link(page.url, page.signature)

    def crawl(self):
        while True:
            page = self.data_store.extract_max_priority_page()
            if page is None:
                break
            if self.data_store.crawled_similar(page.signature):
                self.data_store.reduce_priority_link_to_crawl(page.url)
            else:
                self.crawl_page(page)

由于数据量较大，所以会采用MapReduce的方式，需要考虑结果去重的问题：

class RemoveDuplicateUrls(MRJob):

    def mapper(self, _, line):
        yield line, 1

    def reducer(self, key, values):
        total = sum(values)
        if total == 1:
            yield key, total

去重所依据的方案是对比页面内容所生成的摘要信息。参考资料1，资料2。

页面需要记录抓取时间，根据热度来定期进行数据更新。

使用场景：支持用户搜索操作

查询操作主要做下面工作：
- 将输入语义纠错，分词，转化成标准语法
- 去倒挂索引中查询符合要求的链接列表
- 根据倒挂索引内容去文档服务获取标题，摘要，实际url等信息

4. 完善设计

最终设计图

设计一个网络爬虫

1.描述使用场景和约束

2. 创建系统设计

3. 设计关键组件

4. 完善设计

猜你喜欢

热点阅读