我的第一个爬虫（Python）

2018-03-01 本文已影响189人 loongod

为什么要写爬虫？

前两天想写个后端·API，发现Flask很好用，然后发现我自己没啥数据啊，就想用Python爬点数据，充充数据库。

搜了下常用的Python爬虫常用框架知乎回答，发现Scrapy，后跟着入门教程做了下，确实很方便，很简洁，我学的时候，教程的例子有点问题，我就自己尝试了下，记录下来，也纪念下我的第一次Python爬虫。

本篇只记录自己学习时的例子（很简单），其他的安装，创建项目，语法之类的，请自行学习。

编写第一个爬虫（Spider）

首先创建一个爬虫项目

scrapy startproject tutorial

该命令将会创建包含下列内容的tutorial目录：

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件的作用：

文件	作用
scrapy.cfg	项目的配置文件
tutorial	该项目的python模块，之后您将在此加入代码
tutorial/items.py	项目中的items文件
tutorial/pipelines.py	项目中的pipelines文件
tutorial/settings.py	项目的设置文件
tutorial/spiders/	放置spider代码的目录

先创建一个Item

在tutorial/items.py文件中添加：

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

其实这个item是根据我们想要爬取什么数据来创建的，在这里我们的目标网址是：http://dmoztools.net/Computers/Programming/Languages/Python/Books/

爬取里面Sites下对应的title,link,desc。

添加Item后图片

items.png

创建spider.py

我们在tutorial/spiders/文件夹下创建一个dmoz_spider.py文件，
并在文件中添加如下代码：

# -*- coding:utf-8 -*-

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz" //这个名字必须唯一，是运行爬虫命令的参数
    allowed_domains = ["dmoztools.net"]
    start_urls = [
        "http://dmoztools.net/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@id="sites-section"]//div[@class="site-item "]'):
            title = sel.xpath('div[@class="title-and-desc"]//div[@class="site-title"]/text()').extract()
            link = sel.xpath('div[@class="title-and-desc"]/a/@href').extract() //@href是获取属性值
            desc = sel.xpath('div[@class="title-and-desc"]//div[@class="site-descr "]/text()').extract()
            print(title, link, desc)

爬虫

然后运行命令：scrapy crawl dmoz

crawl.png

可以看到终端中打印爬到的数据

在写XPath的时候，发现目标网站的属性class后面有的会多一个空格，但是不注意的话，还真发现不了，进而导致写的xpath匹配不到数据，😄，刚开始弄得我真是头晕，希望大家注意。

然后我们用上item，并把爬到的数据输出到json文件中，更改dmoz_spider.py文件中的代码如下：

# -*- coding:utf-8 -*-

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz"
    allowed_domains = ["dmoztools.net"]
    start_urls = [
        "http://dmoztools.net/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@id="sites-section"]//div[@class="site-item "]'):
            item = DmozItem()
            item["title"] = sel.xpath('div[@class="title-and-desc"]//div[@class="site-title"]/text()').extract()
            item["link"] = sel.xpath('div[@class="title-and-desc"]/a/@href').extract()
            item["desc"] = sel.xpath('div[@class="title-and-desc"]//div[@class="site-descr "]/text()').extract()
            yield item

然后运行命令：scrapy crawl dmoz -o items.json

得到item.json文件

[{"title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "desc": ["\r\n\t\t\t\r\n                                    The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n                                    ", "\r\n                                  "]},
{"title": ["Dive Into Python 3 "], "link": ["http://www.diveintopython.net/"], "desc": ["\r\n\t\t\t\r\n                                    By Mark Pilgrim, Guide to Python 3  and its differences from Python 2. Each chapter starts with a real code sample and explains it fully. Has a comprehensive appendix of all the syntactic and semantic changes in Python 3\r\n\r\n\r\n                                    ", "\r\n                                  "]},
{"title": ["Foundations of Python Network Programming "], "link": ["http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/"], "desc": ["\r\n\t\t\t\r\n                                    This book covers a wide range of topics. From raw TCP and UDP to encryption with TSL, and then to HTTP, SMTP, POP, IMAP, and ssh. It gives you a good understanding of each field and how to do everything on the network with Python.\r\n                                    ", "\r\n                                  "]},
{"title": ["Free Python books "], "link": ["http://www.techbooksforfree.com/perlpython.shtml"], "desc": ["\r\n\t\t\t\r\n                                    Free Python books and tutorials.\r\n                                    ", "\r\n                                  "]},
{"title": ["FreeTechBooks: Python Scripting Language "], "link": ["http://www.freetechbooks.com/python-f6.html"], "desc": ["\r\n\t\t\t\r\n                                    Annotated list of free online books on Python scripting language. Topics range from beginner to advanced.\r\n                                    ", "\r\n                                  "]},
{"title": ["How to Think Like a Computer Scientist: Learning with Python "], "link": ["http://greenteapress.com/thinkpython/"], "desc": ["\r\n\t\t\t\r\n                                    By Allen B. Downey, Jeffrey Elkner, Chris Meyers; Green Tea Press, 2002, ISBN 0971677506. Teaches general principles of programming, via Python as subject language. Thorough, in-depth approach to many basic and intermediate programming topics. Full text online and downloads: HTML, PDF, PS, LaTeX. [Free, Green Tea Press]\r\n                                    ", "\r\n                                  "]},
{"title": ["An Introduction to Python "], "link": ["http://www.network-theory.co.uk/python/intro/"], "desc": ["\r\n\t\t\t\r\n                                    By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161769. Printed edition of official tutorial, for v2.x, from Python.org. [Network Theory, online]\r\n                                    ", "\r\n                                  "]},
{"title": ["Making Use of Python "], "link": ["http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html"], "desc": ["\r\n\t\t\t\r\n                                    By Rashi Gupta; John Wiley and Sons, 2002, ISBN 0471219754. Covers language basics, use for CGI scripting, GUI development, network programming; shows why it is one of more sophisticated of popular scripting languages. [Wiley]\r\n                                    ", "\r\n                                  "]},
{"title": ["Practical Python "], "link": ["http://hetland.org/writing/practical-python/"], "desc": ["\r\n\t\t\t\r\n                                    By Magnus Lie Hetland; Apress LP, 2002, ISBN 1590590066. Readable guide to ideas most vital to new users, from basics common to high level languages, to more specific aspects, to a series of 10 ever more complex programs. [Apress]\r\n                                    ", "\r\n                                  "]},
{"title": ["Pro Python System Administration "], "link": ["http://sysadminpy.com/"], "desc": ["\r\n\t\t\t\r\n                                    By Rytis Sileika, ISBN13: 978-1-4302-2605-5, Uses real-world system administration examples like manage devices with SNMP and SOAP, build a distributed monitoring system, manage web applications and parse complex log files, monitor and manage MySQL databases.\r\n                                    ", "\r\n                                  "]},
{"title": ["Programming in Python 3 (Second Edition) "], "link": ["http://www.qtrac.eu/py3book.html"], "desc": ["\r\n\t\t\t\r\n                                    A Complete Introduction to the Python 3.\r\n                                    ", "\r\n                                  "]},
{"title": ["Python 2.1 Bible "], "link": ["http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html"], "desc": ["\r\n\t\t\t\r\n                                    By Dave Brueck, Stephen Tanner; John Wiley and Sons, 2001, ISBN 0764548077. Full coverage, clear explanations, hands-on examples, full language reference; shows step by step how to use components, assemble them, form full-featured programs. [John Wiley and Sons]\r\n                                    ", "\r\n                                  "]},
{"title": ["Python 3 Object Oriented Programming "], "link": ["https://www.packtpub.com/python-3-object-oriented-programming/book"], "desc": ["\r\n\t\t\t\r\n                                    A step-by-step tutorial for OOP in Python 3, including discussion and examples of abstraction, encapsulation, information hiding, and raise, handle, define, and manipulate exceptions.\r\n                                    ", "\r\n                                  "]},
{"title": ["Python Language Reference Manual "], "link": ["http://www.network-theory.co.uk/python/language/"], "desc": ["\r\n\t\t\t\r\n                                    By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161785. Printed edition of official language reference, for v2.x, from Python.org, describes syntax, built-in datatypes. [Network Theory, online]\r\n                                    ", "\r\n                                  "]},
{"title": ["Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython "], "link": ["http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1"], "desc": ["\r\n\t\t\t\r\n                                    By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website]\r\n                                    ", "\r\n                                  "]},
{"title": ["Sams Teach Yourself Python in 24 Hours "], "link": ["http://www.informit.com/store/product.aspx?isbn=0672317354"], "desc": ["\r\n\t\t\t\r\n                                    By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing]\r\n                                    ", "\r\n                                  "]},
{"title": ["Text Processing in Python "], "link": ["http://gnosis.cx/TPiP/"], "desc": ["\r\n\t\t\t\r\n                                    By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.]\r\n                                    ", "\r\n                                  "]},
{"title": ["XML Processing with Python "], "link": ["http://www.informit.com/store/product.aspx?isbn=0130211192"], "desc": ["\r\n\t\t\t\r\n                                    By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\r\n                                    ", "\r\n                                  "]}
]

使用JSON-handle扩展查看

crawl-o.png

在dmoz_spider.py中，for-in循环的时候，里面那一层的xpath不要用//开头了，发现用了之后会爬去目标网址的所有匹配的数据，而不是for遍历出的sel下匹配的数据。

比如，我把dmoz_spider.py中的代码稍作修改：

    def parse(self, response):
        for sel in response.xpath('//div[@id="sites-section"]//div[@class="site-item "]'):
            item = DmozItem()
            //只在此处的xpath条件前增加"//"
            item["title"] = sel.xpath('//div[@class="title-and-desc"]//div[@class="site-title"]/text()').extract()
            item["link"] = sel.xpath('div[@class="title-and-desc"]/a/@href').extract()
            item["desc"] = sel.xpath('div[@class="title-and-desc"]//div[@class="site-descr "]/text()').extract()
            yield item

下面是对比结果：

数据对比.png

这个是XPath的路径表达式的原因：

位置路径可以是绝对的，也可以是相对的。
绝对路径起始于正斜杠（/）,而相对路径不是这样，不使用正斜杠（/）开头

绝对位置路径：
/step/step/...

相对位置路径：
step/step/... 或者 .//step/step/...

比如，假设你想提取<div>元素中的所有<p>元素。首先，你将先得到所有的<div>元素：

 >>> divs = response.xpath('//div')

开始时，你可能会尝试使用下面的错误的方法，因为它其实是从整篇文档中，而不仅仅是从那些<div> 元素内部提取所有的<p>元素:

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

下面是比较合适的处理方法(注意.//pXPath的点前缀):

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另一种常见的情况将是提取所有直系<p>的结果:

>>> for p in divs.xpath('p'):
...     print p.extract()

写入mysql

表以及字段名要在爬虫运行之前先建好

在pipelines.py中增加如下代码：

import sqlite3


class TutorialPipeline(object):

    def __init__(self, sqlite_file, sqlite_table):
        self.sqlite_file = sqlite_file
        self.sqlite_table = sqlite_table

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            sqlite_file = crawler.settings.get('SQLITE_FILE'), #从settings.py获取
            sqlite_table = crawler.settings.get('SQLITE_TABLE', 'items')
        )

    def open_spider(self, spider):
        self.conn = sqlite3.connect(self.sqlite_file) // 连接数据库
        self.cur = self.conn.cursor() //获取游标


    def close_spider(self, spider):
        self.conn.close() //关闭数据库

    def process_item(self, item, spider):
        insert_sql = "insert into {0}({1}) values ({2})".format(self.sqlite_table, ','.join(item.keys()), ','.join(['?'] * len(item.fields.keys())))
        print(insert_sql, item.values())
        self.cur.execute(insert_sql, item.values())
        self.conn.commit()
        return item

settings.py中增加:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

SQLITE_FILE = 'sinaBook.db'
SQLITE_TABLE = 'book'

ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
}

另外把dmoz_spider.py中的代码稍作修改：

    def parse(self, response):
        for sel in response.xpath('//div[@class="wm-list-f"]'):
            for bookSel in sel.xpath('div[@class="book_list"]/ul/li'):
                print(bookSel)
                item = BookItem()
                item["bookName"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_name"]/a/text()').extract()[0] // 因为之前返回的数据都是在数组中，这里直接获取数据
                item["bookAuthor"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_author"]/text()').extract()[0]
                item["bookType"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_type"]/a/text()').extract()[0]
                item["bookInfo"] = bookSel.xpath('div[@class="book_info"]//p[@class="info"]/a/text()').extract()[0]
                item["bookLink"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_read"]/a/@href').extract()[0]
                yield item

然后运行爬虫：scrapy crawl sinabook

sqlite.png

号外

定位目标代码：

chrom浏览器，检查元素，然后command+c或者点击下图标记，可以打开在页面上选取元素审查功能。

inspecter.png

显示中文：
如果要爬取中文数据的话，返回的是unicode字符串，但是写入数据库就正常显示了。

也可以使用：scrapy crawl yourSpiderName -o name.json -s FEED_EXPORT_ENCODING=utf-8

这样输出到json文件中就是中文了。

items.py

import scrapy

class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class BookItem(scrapy.Item):
    bookName = scrapy.Field()
    bookAuthor = scrapy.Field()
    bookType = scrapy.Field()
    bookInfo = scrapy.Field()
    bookLink = scrapy.Field()

sinabook.py

# -*- coding:utf-8 -*-

import scrapy
from tutorial.items import BookItem

class DmozSpider(scrapy.spiders.Spider):
    name = "sinabook"
    allowed_domains = ["vip.book.sina.com"]
    start_urls = [
        "http://vip.book.sina.com.cn/weibobook/man.php?pos=202058"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="wm-list-f"]'):
            for bookSel in sel.xpath('div[@class="book_list"]/ul/li'):
                print(bookSel)
                item = BookItem()
                item["bookName"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_name"]/a/text()').extract()
                item["bookAuthor"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_author"]/text()').extract()
                item["bookType"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_type"]/a/text()').extract()
                item["bookInfo"] = bookSel.xpath('div[@class="book_info"]//p[@class="info"]/a/text()').extract()
                item["bookLink"] = bookSel.xpath('div[@class="book_info"]//p[@class="book_read"]/a/@href').extract()
                yield item

运行：scrapy crawl sinabook -o items.json -s FEED_EXPORT_ENCODING=utf-8

items.json

[
{"bookAuthor": ["方恨晚"], "bookName": ["逍遥法医"], "bookLink": ["/book/play/5365733-0.html?pos=20057"], "bookType": ["都市校园"], "bookInfo": ["偶得神秘玉佩，开天眼、习医术、修玄功，衰男大翻身，成就一代超级法医，自此窥生死、…"]},
{"bookAuthor": ["一织"], "bookName": ["隔壁的青铜女孩"], "bookLink": ["/book/play/5394624-0.html?pos=20057"], "bookType": ["都市校园"], "bookInfo": ["《隔壁的青铜女孩》简介：\r\n\r\n你会玩LOL吗？\r\n你有遇到过奇怪的小萌妹吗？\r…"]},
{"bookAuthor": ["欧阳晕"], "bookName": ["极限武尊"], "bookLink": ["/book/play/5353290-0.html?pos=20058"], "bookType": ["玄幻奇幻"], "bookInfo": ["武者，罡劲雄浑。\r\n气修，变幻莫测。\r\n陆凡，一名武道与炼气同修之士。\r\n我本平…"]},
{"bookAuthor": ["仙路渺茫"], "bookName": ["滴血认主"], "bookLink": ["/book/play/238973-0.html?pos=20058"], "bookType": ["玄幻奇幻"], "bookInfo": ["无论是在空中高速飞行的飞剑，还是各种各样的灵兽、仙兽，只要你一滴鲜血与其结合，就…"]},
{"bookAuthor": ["番茄死不了"], "bookName": ["法医灵异录"], "bookLink": ["/book/play/5368149-0.html?pos=20059"], "bookType": ["悬疑灵异"], "bookInfo": ["主人公凌凡,一个普通的高中生,无意中拿到法医哥哥凌枫遗留给自己的神秘备忘录,从此…"]},
{"bookAuthor": ["东北神汉"], "bookName": ["出马仙：我当大仙的那几"], "bookLink": ["/book/play/5384290-0.html?pos=20059"], "bookType": ["悬疑灵异"], "bookInfo": ["南茅北马，自古以来以山海关为界，南方属茅山道术，北方则是出马仙马家，出马仙继承了…"]},
{"bookAuthor": ["四关"], "bookName": ["巨匪"], "bookLink": ["/book/play/5357907-0.html?pos=20060"], "bookType": ["历史军事"], "bookInfo": ["他是一个土匪。他不懂何为苍天大义，天下民生，也不甘随波逐流，任乱世沉浮。很多年后…"]},
{"bookAuthor": ["龙竹"], "bookName": ["雄霸楚汉"], "bookLink": ["/book/play/5345927-0.html?pos=20060"], "bookType": ["历史军事"], "bookInfo": ["特种兵王龙天羽，因一次保护神秘皇陵出土的宝物，而意外穿越时空，来到了秦朝末年，此…"]}
]