scrapy

Scrapy爬虫——xpath与css选择器详解

2017-09-18  本文已影响0人  youyuge

有条件的请支持慕课实战正版课程,本blog仅仅是归纳总结,自用。

一、xpath部分

1.1 xpath简介

xpath简介.png

1.2 xpath语法

xpath语法图

1.3 xpath谓语语法

谓语(Predicates)谓语用来查找某个特定的节点或者包含某个指定的值的节点。谓语被嵌在方括号中。

xpathwei'yu

1.4 xpath其他语法

通配符 描述
* 匹配任何元素节点。
@* 匹配任何属性节点。
node() 匹配任何类型的节点。
xpath其他语法

二、css选择器

css选择器 css选择器2 css选择器3

三、scrapy选择器实战

这里是它的HTML源码:

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br />![](image1_thumb.jpg)</a>
   <a href='image2.html'>Name: My image 2 <br />![](image2_thumb.jpg)</a>
   <a href='image3.html'>Name: My image 3 <br />![](image3_thumb.jpg)</a>
   <a href='image4.html'>Name: My image 4 <br />![](image4_thumb.jpg)</a>
   <a href='image5.html'>Name: My image 5 <br />![](image5_thumb.jpg)</a>
  </div>
 </body>
</html>

3.1 构造选择器

首先, 我们打开shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
>>> response.xpath('//title/text()').extract()
[u'Example website']
>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '
>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

3.2选择器嵌套

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br>![](image1_thumb.jpg)</a>',
 u'<a href="image2.html">Name: My image 2 <br>![](image2_thumb.jpg)</a>',
 u'<a href="image3.html">Name: My image 3 <br>![](image3_thumb.jpg)</a>',
 u'<a href="image4.html">Name: My image 4 <br>![](image4_thumb.jpg)</a>',
 u'<a href="image5.html">Name: My image 5 <br>![](image5_thumb.jpg)</a>']

>>> for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

3.3 结合正则表达式使用选择器(selectors)

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'
上一篇 下一篇

猜你喜欢

热点阅读