使用Scrapy爬取网页数据并保存到MongoDB

2018-04-22 本文已影响801人 nextliving

scrapy是一个使用python编写的免费开源的网页爬取框架，最初只用于网页爬取，后来也可以用于解析从API获取的数据。最初诞生于英国伦敦的一家电商公司Mydeco,于2008年首次发行公开版本，并于2011年交由scrapinghub公司维护。本文会演示如何用scrapy抓取stackoverflow上的最新问题数据(问题的标题和URL)以及如何把数据保存到mongodb数据库。

创建scrapy项目

根据 scrapy官网文档，如果你希望将项目代码保存在某个文件夹下，你就在那个文件夹下执行scrapy startproject命令,比如我使用的python虚拟环境是engchen(这个虚拟环境位于/Users/chenxin/ProjectsEnv),创建对应的项目名称也是engchen(希望保存到/Users/chenxin/PycharmProjects),执行的代码如下:

(engchen) MacBookPro:PycharmProjects chenxin$ scrapy startproject engchen

创建成功以后终端输出


New Scrapy project 'engchen', using template directory '/Users/chenxin/ProjectsEnv/engchen/lib/python2.7/site-packages/scrapy/templates/project', created in:

 /Users/chenxin/PycharmProjects/engchen

You can start your first spider with:

 cd engchen

 scrapy genspider example example.com

执行该指令后会创建一个名为engchen的项目根目录，里面包含了一个基础模版所具备的文件和文件夹结构:


├── scrapy.cfg

└── stack

 ├── __init__.py

 ├── items.py

 ├── pipelines.py

 ├── settings.py

 └── spiders

 └── __init__.py

指定要抓取的数据类型

items.py模块用于为我们即将要抓取的数据定义存储容器。

打开items.py文件，有一个类EngchenItem,可以看到这个类继承自scrapy的 Item基类.

添加我们希望采集的items,更新items.py文件：


# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item,Field

class EngchenItem(scrapy.Item):

 # define the fields for your item here like:

 # name = scrapy.Field()

 title = Field()

 url = Field()

 pass

在“spider”文件夹下创建一个名为engchen_spider.py的文件。在这里我们操控爬虫寻找我们想要的数据，这个文件只能专属于某一个网页，而不能用于采集其它网页的数据。

定义一个类继承自Scrapy的Spider基类：


# -*- coding: utf-8 -*-

from scrapy import Spider

class EngchenSpider(Spider)

 name = "engchen"

 allowed_domains = ["stackoverflow.com"]

 start_urls = ["http://stackoverflow.com/questions?sort=newest"]

关于几个参数的解释：

name定义了爬虫的名字
allowed_domains包含了爬虫爬取链接的基链接
start_urls是爬虫爬取链接的列表

XPath选择器

Scrapy使用XPath选择器从网页解析出数据，换言之，可以根据指定的XPath获得HTML数据中的特定部分。Scrapy文档的选择器(Selector)部分对XPath介绍如下:

XPath is a language for selecting nodes in XML documents, which can also be used with HTML

现在用chrome打开 stackoverflow，找到我们想要的XPath。把鼠标放在第一个问题的标题上，右键->检查：

image

找到[div class="summary"]对应的XPath: //*[@id="question-summary-37872090"]/div[2],可以在chrome的开发工具Console部分使用$x语法检查这个XPath是否能选中第一个问题,在Console中执行:

image

可以看到，正好选中了第一个问题的代码块，其中的h3对应的正是问题的标题。

如果我们要调整XPath，让它能选中该页面全部的问题，该怎么做呢？很简单，使用这个XPath: //div[@class="summary"]/h3。这是什么意思呢？

这个XPath会抓取所有类是summary的div下面的h3子元素。

现在更新engchen_spider.py：


# -*- coding: utf-8 -*-

from scrapy import Spider

from scrapy.selector import Selector

class EngchenSpider(Spider):

 name = "engchen"

 allowed_domains = ["stackoverflow.com"]

 start_urls = ["http://stackoverflow.com/questions?sort=newest"]

 def parse(self,response):

 questions = Selector(response).xpath('//div[@class="summary"]/h3')

解析数据

仅仅获取的问题是不够的，要把每一个问题的标题和链接提取出来，更新engchen_spider.py:


# -*- coding: utf-8 -*-

from scrapy import Spider

from scrapy.selector import Selector

from engchen.items import EngchenItem

class EngchenSpider(Spider):

 name = "engchen"

 allowed_domains = ["stackoverflow.com"]

 start_urls = ["http://stackoverflow.com/questions?sort=newest"]

 def parse(self,response):

 questions = Selector(response).xpath('//div[@class="summary"]/h3')

 for question in questions:

 item = EngchenItem()

 item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]

 item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]

 yield item

测试爬虫

在项目文件夹engchen下执行

$ scrapy crawl engchen

终端输出以下信息:


2016-06-17 12:51:03 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)

2016-06-17 12:51:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}

2016-06-17 12:51:03 [scrapy] INFO: Enabled extensions:

['scrapy.extensions.logstats.LogStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.corestats.CoreStats']

2016-06-17 12:51:03 [scrapy] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2016-06-17 12:51:03 [scrapy] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2016-06-17 12:51:03 [scrapy] INFO: Enabled item pipelines:

[]

2016-06-17 12:51:03 [scrapy] INFO: Spider opened

2016-06-17 12:51:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-06-17 12:51:03 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-06-17 12:51:15 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)

2016-06-17 12:51:26 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'how to display picture in picture object using path in crystal report C#.net',

 'url': u'/questions/37873352/how-to-display-picture-in-picture-object-using-path-in-crystal-report-c-net'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u"C language paese file's detail",

 'url': u'/questions/37873351/c-language-paese-files-detail'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Shibboleth custom password flow',

 'url': u'/questions/37873350/shibboleth-custom-password-flow'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Httpurlconnection getresponsecode throws eof exception. Tried the max try method',

 'url': u'/questions/37873348/httpurlconnection-getresponsecode-throws-eof-exception-tried-the-max-try-method'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Is a Spring MVC App with Thymeleaf RESTful?',

 'url': u'/questions/37873347/is-a-spring-mvc-app-with-thymeleaf-restful'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'How to bind an arbitrary number of values with procedural-style MySQLi prepared statement?',

 'url': u'/questions/37873346/how-to-bind-an-arbitrary-number-of-values-with-procedural-style-mysqli-prepared'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Use Ffmpeg on android with linux commands',

 'url': u'/questions/37873345/use-ffmpeg-on-android-with-linux-commands'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Play Framework 2.3.x: Wrap Request object using Scala Oauth in Play Framework',

 'url': u'/questions/37873343/play-framework-2-3-x-wrap-request-object-using-scala-oauth-in-play-framework'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'FTP Support ChromeCast',

 'url': u'/questions/37873341/ftp-support-chromecast'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u"Regex to find matches words that is in a line that doesn't start with",

 'url': u'/questions/37873339/regex-to-find-matches-words-that-is-in-a-line-that-doesnt-start-with'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'ParseEroor in android Volley',

 'url': u'/questions/37873337/parseeroor-in-android-volley'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Rails: maintaining application-dependent data',

 'url': u'/questions/37873334/rails-maintaining-application-dependent-data'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Cannot Enable Autoexposure via V4L2',

 'url': u'/questions/37873331/cannot-enable-autoexposure-via-v4l2'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'how do i use dct to extract features from image?',

 'url': u'/questions/37873327/how-do-i-use-dct-to-extract-features-from-image'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'How to disable/ lock one page in viewpager?',

 'url': u'/questions/37873326/how-to-disable-lock-one-page-in-viewpager'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Need Text finder and convert to PDF file for .DWG files',

 'url': u'/questions/37873324/need-text-finder-and-convert-to-pdf-file-for-dwg-files'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'spring mvc: the difference between DeferredResult and ListenableFuture?',

 'url': u'/questions/37873322/spring-mvc-the-difference-between-deferredresult-and-listenablefuture'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'.AsReadOnly() not included PCL despite it being listed as supported in MSDN',

 'url': u'/questions/37873317/asreadonly-not-included-pcl-despite-it-being-listed-as-supported-in-msdn'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'j-query function to update value onchange',

 'url': u'/questions/37873314/j-query-function-to-update-value-onchange'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Why Are format specifiers used in C',

 'url': u'/questions/37873312/why-are-format-specifiers-used-in-c'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'H5py store list of list of strings',

 'url': u'/questions/37873311/h5py-store-list-of-list-of-strings'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'how to get calculated values(totalPriceAmt) from js to html in by using angular js',

 'url': u'/questions/37873310/how-to-get-calculated-valuestotalpriceamt-from-js-to-html-in-by-using-angular'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Jquery length not working with elements with variable in name',

 'url': u'/questions/37873306/jquery-length-not-working-with-elements-with-variable-in-name'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'get succesor Binary Search Tree c++ Data structure',

 'url': u'/questions/37873303/get-succesor-binary-search-tree-c-data-structure'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u"i have added fragments in android studio but on running it , it's just loading and not showing fragments",

 'url': u'/questions/37873295/i-have-added-fragments-in-android-studio-but-on-running-it-its-just-loading-a'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Easy Share Application - Possibly Null Error?',

 'url': u'/questions/37873293/easy-share-application-possibly-null-error'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'How to retrieve back signed value',

 'url': u'/questions/37873292/how-to-retrieve-back-signed-value'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Uploading the app to AppStore when DataBase is changed(iOS)',

 'url': u'/questions/37873291/uploading-the-app-to-appstore-when-database-is-changedios'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Getting properties object in spring',

 'url': u'/questions/37873289/getting-properties-object-in-spring'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'SSIS make LastModifiedProductVersion from 10.50.1600.1 to 10.50.6000.34',

 'url': u'/questions/37873286/ssis-make-lastmodifiedproductversion-from-10-50-1600-1-to-10-50-6000-34'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Servlet not loading on startup in webphere 8.5.5',

 'url': u'/questions/37873285/servlet-not-loading-on-startup-in-webphere-8-5-5'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'will conversion of timestamp to date in diiferent timezones returns different date and time?',

 'url': u'/questions/37873284/will-conversion-of-timestamp-to-date-in-diiferent-timezones-returns-different-da'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Structure of a Node.Js API with MySQL',

 'url': u'/questions/37873280/structure-of-a-node-js-api-with-mysql'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'WordPress: Too many taxonomies slow down the site',

 'url': u'/questions/37873279/wordpress-too-many-taxonomies-slow-down-the-site'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Team Coding Client + Server',

 'url': u'/questions/37873277/team-coding-client-server'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'separate characters and numbers from a string',

 'url': u'/questions/37873276/separate-characters-and-numbers-from-a-string'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Throttle function for 2 seconds',

 'url': u'/questions/37873275/throttle-function-for-2-seconds'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'How to set customer _id by find other model By strongloop and mongodb',

 'url': u'/questions/37873274/how-to-set-customer-id-by-find-other-model-by-strongloop-and-mongodb'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'com.phonegap.www is already in use by an app owned by another developer',

 'url': u'/questions/37873273/com-phonegap-www-is-already-in-use-by-an-app-owned-by-another-developer'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'PHP :- How to handle Doc files (preview and edit)',

 'url': u'/questions/37873271/php-how-to-handle-doc-files-preview-and-edit'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Laravel ORM Relationship',

 'url': u'/questions/37873270/laravel-orm-relationship'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'reading a json in spark',

 'url': u'/questions/37873269/reading-a-json-in-spark'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Where to find the Android app native code on my test server and how to decompile it to Java?',

 'url': u'/questions/37873267/where-to-find-the-android-app-native-code-on-my-test-server-and-how-to-decompile'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'401\. That\u2019s an error. Error: invalid_client,no registered origin',

 'url': u'/questions/37873266/401-that-s-an-error-error-invalid-client-no-registered-origin'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u"addAllowedApplication(String packageName) method of VPNService.Buildr class doesn't work on api level 14",

 'url': u'/questions/37873265/addallowedapplicationstring-packagename-method-of-vpnservice-buildr-class-does'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'I cant remove an entity JPA, JEE7',

 'url': u'/questions/37873262/i-cant-remove-an-entity-jpa-jee7'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'How to convert this MySQL query to Yii2 ActiveQuery format?',

 'url': u'/questions/37873259/how-to-convert-this-mysql-query-to-yii2-activequery-format'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'CSS style checkbox inside td',

 'url': u'/questions/37873256/css-style-checkbox-inside-td'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u"How do I read Class paths in Java's API documentation?",

 'url': u'/questions/37873254/how-do-i-read-class-paths-in-javas-api-documentation'}

2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'IPBoard hook for IPSMember',

 'url': u'/questions/37873253/ipboard-hook-for-ipsmember'}

2016-06-17 12:51:26 [scrapy] INFO: Closing spider (finished)

2016-06-17 12:51:26 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 512,

 'downloader/request_count': 2,

 'downloader/request_method_count/GET': 2,

 'downloader/response_bytes': 31240,

 'downloader/response_count': 2,

 'downloader/response_status_count/200': 2,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2016, 6, 17, 4, 51, 26, 211130),

 'item_scraped_count': 50,

 'log_count/DEBUG': 53,

 'log_count/INFO': 7,

 'response_received_count': 2,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2016, 6, 17, 4, 51, 3, 602022)}

2016-06-17 12:51:26 [scrapy] INFO: Spider closed (finished)

当然也可以把爬取的数据保存到名为quesion的json文件中:

`$ scrapy crawl engchen -o question.json -t json

爬行完成，发现项目文件夹下多了question.json文件，打开question.json,共有50条数据:

image

把数据存储到MongoDB中

每次抓取到一条数据之后，把它存储到MongoDB中。

第一步是创建保存爬取数据的数据库，打开setting.py,指定pipeline并添加数据库设置


ITEM_PIPELINES = {'engchen.pipelines.MongoDBPipeline':100}

MONGODB_SERVER = "localhost"

MONGODB_PORT = 27017

MONGODB_DB = "stackoverflow"

MONGODB_COLLECTION = "questions"

配置数据管道流(pipeline)

配置了爬虫爬取并解析数据，配置了数据库选项，现在需要配置管道流文件pipelines.py将二者连接起来

首先，定义一个连接数据库的方法


# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from scrapy.conf import settings

class MongoDBPipeline(object):

 def __init__(self):

 connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))

 db = connection[settings.get('MONGODB_DB')]

 self.collection = db[settings.get('MONGODB_COLLECTION')]

然后需要定义一个方法处理数据


# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from scrapy.conf import settings

from scrapy.exceptions import DropItem

from scrapy import log

class MongoDBPipeline(object):

 def __init__(self):

 connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))

 db = connection[settings.get('MONGODB_DB')]

 self.collection = db[settings.get('MONGODB_COLLECTION')]

 def process_item(self, item, spider):

 valid = True

 for data in item:

 if not data:

 valid = False

 raise DropItem("Missing {0}!".format(data))

 if valid:

 self.collection.insert(dict(item))

 log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)

 return item

测试爬取数据能否成功保存到MongoDB

先启动mongdb

$ mongod

在项目文件夹engchen下执行

$ scrapy crawl engchen

终端输出以下信息:


2016-06-17 14:20:23 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)

2016-06-17 14:20:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}

2016-06-17 14:20:23 [scrapy] INFO: Enabled extensions:

['scrapy.extensions.logstats.LogStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.corestats.CoreStats']

2016-06-17 14:20:23 [scrapy] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2016-06-17 14:20:23 [scrapy] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2016-06-17 14:20:23 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:11: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.

 from scrapy import log

2016-06-17 14:20:23 [scrapy] INFO: Enabled item pipelines:

['engchen.pipelines.MongoDBPipeline']

2016-06-17 14:20:23 [scrapy] INFO: Spider opened

2016-06-17 14:20:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-06-17 14:20:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-06-17 14:20:24 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)

2016-06-17 14:20:25 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)

2016-06-17 14:20:25 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:28: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead

 log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'how to remove all website addresses in bulk using regex',

 'url': u'/questions/37874402/how-to-remove-all-website-addresses-in-bulk-using-regex'}

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>

{'title': u'Dynamic subdomain creation in wp',

 'url': u'/questions/37874401/dynamic-subdomain-creation-in-wp'}

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!

......

然后，使用MongoDB数据库工具 Robomongo查看刚刚存储的数据:

image

结束语

通过本文可以简单了解Scrapy的使用，项目源代码已经托管到 Github。

使用Scrapy爬取网页数据并保存到MongoDB

相关package的安装

MongoDB

Scrapy

PyMongo

创建scrapy项目

指定要抓取的数据类型

XPath选择器

解析数据

测试爬虫

把数据存储到MongoDB中

配置数据管道流(pipeline)

测试爬取数据能否成功保存到MongoDB

结束语

参考

猜你喜欢

热点阅读