Python之路大数据 爬虫Python AI Sql程序员

爬虫课堂(二十二)|使用LinkExtractor提取链接

2018-03-27  本文已影响417人  小怪聊职场

在爬取一个网站时,要爬取的数据通常不全是在一个页面上,每个页面包含一部分数据以及到其他页面的链接。比如前面讲到的获取简书文章信息,在列表页只能获取到文章标题、文章URL及文章的作者名称,如果要获取文章的详细内容和文章的评论只能去到文章的详情页中获取。
获取数据的方法在前面章节中已经讲解过,当然也使用Selector获取过文章URL,那么LinkExtractor又有什么特别之处呢?为什么说LinkExtrator非常适合整站抓取?下面将对LinkExtrator做一个介绍。
一、LinkExtractor基本使用
以获取简书首页的文章信息为例,我们使用LinkExtractor提取网站上的链接,如图22-1所示,提取的是class=note-list下的所有<li>中的链接。

图22-1
代码如下:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import scrapy
from scrapy.linkextractor import LinkExtractor

class Jianshu(scrapy.Spider):
    name = "jianshu_spider"
    allowed_domains = ["jianshu.com"]

    def __init__(self, *args, **kwargs):
        super(Jianshu, self).__init__(*args, **kwargs)
        self.start_urls = ['https://www.jianshu.com/']

    def parse(self, response):
        link = LinkExtractor(restrict_xpaths='//ul[@class="note-list"]/li')
        links = link.extract_links(response)
        if links:
            for link_one in links:
                print (link_one)

1)先使用from scrapy.linkextractor import LinkExtractor导入LinkExtractor
2)创建一个LinkExtractor对象,使用构造器参数描述提取规则,这里是使用XPaths选择器表达式给restrict_xpaths传递参数。
3)调用LinkExtractor对象的extract_links方法传入一个Response对象,该方法依据创建对象描述的提取规则在Response对象所包含的页面中提取链接,并返回一个列表,列表中每个元素都是一个Link对象,即提取到的一个链接。

执行上面的代码,得到如下结果:

Link(url='https://www.jianshu.com/p/29621e57077f', text='\n      \n    ', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/u/bfb1aa483a03', text='\n        \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/737d34bf3e55', text='\n            \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/29621e57077f#comments', text='\n           338\n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/c9185e45c4e2', text='\n      \n    ', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/u/cbd1e1bd402e', text='\n        \n', fragment='', nofollow=False)
Link(url='http://www.jianshu.com/p/d1d89ed69098', text='\n            \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/c9185e45c4e2#comments', text='\n           80\n', fragment='', nofollow=False)
...
...
Link(url='https://www.jianshu.com/p/1b638533689f', text='\n      \n    ', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/u/5428ad454a2c', text='\n        \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/1b638533689f#comments', text='\n           31\n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/9f18c99fb70c', text='\n      \n    ', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/u/a6c211083de3', text='\n        \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/9f18c99fb70c#comments', text='\n           39\n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/fdc53c576324', text='\n      \n    ', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/u/2b3ad4f2a058', text='\n        \n', fragment='', nofollow=False)
Link(url='https://www.jianshu.com/p/fdc53c576324#comments', text='\n           33\n', fragment='', nofollow=False)

我们发现每个Link有一个url,可以通过link.url获取链接信息,如下代码,把print link修改为print link.url

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import scrapy
from scrapy.linkextractor import LinkExtractor


class Jianshu(scrapy.Spider):
    name = "jianshu_spider"
    allowed_domains = ["jianshu.com"]

    def __init__(self, *args, **kwargs):
        super(Jianshu, self).__init__(*args, **kwargs)
        self.start_urls = ['https://www.jianshu.com/']

    def parse(self, response):
        link = LinkExtractor(restrict_xpaths='//ul[@class="note-list"]/li')
        links = link.extract_links(response)
        if links:
            for link_one in links:
                print link_one.url

运行之后得到的结果:

https://www.jianshu.com/p/ee7e0647fb56
https://www.jianshu.com/u/a7f876850fa6
http://www.jianshu.com/p/d1d89ed69098
https://www.jianshu.com/p/ee7e0647fb56#comments
...
...
https://www.jianshu.com/p/2b55d55d2100
https://www.jianshu.com/u/08529139a77b
https://www.jianshu.com/p/2b55d55d2100#comments

以上就是LinkExtractor的基本使用,接下来继续介绍LinkExtractor更多的提取参数方法。
二、更多的提取参数方法

def parse(self, response):
    pattern = '/gsschool/.+\.shtml'
    link = LinkExtractor(allow=pattern)
    links = link.extract_links(response)
def parse(self, response):
   pattern = '/gsschool/.+\.shtml'
   link = LinkExtractor(deny=pattern)
   links = link.extract_links(response)
def parse(self, response):
    domain = ['gaosivip.com','gaosiedu.com']
    link = LinkExtractor(allow_domains=domain)
    links = link.extract_links(response)
def parse(self, response):
    domain = ['gaosivip.com','gaosiedu.com']
    link = LinkExtractor(deny_domains=domain)
    links = link.extract_links(response)
def parse(self, response):
    link = LinkExtractor(restrict_css='ul.note-list > li')
    links = link.extract_links(response)
def parse(self, response):
    link = LinkExtractor(tags='a',attrs='href')
    links = link.extract_links(response)
上一篇下一篇

猜你喜欢

热点阅读