Scrapy 学习2 xpath简介

2017-11-10 本文已影响9人法号无涯

<html>
  <head>
    <title>Title of the page</title>
  </head>
  <body>
    <h1>H1 Tag</h1>
    <h2>H2 Tag with <a href="#">link</a></h2>
    <p>First Paragraph</p>
    <p>Second Paragraph</p>
  </body>
</html>

以上面的简单html页面为例
依次写入命令：
scrapy shell
from scrapy.selector import Selector
文本编辑器里输入一下内容并复制：

html_doc = '''
<html>
  <head>
    <title>Title of the page</title>
  </head>
  <body>
    <h1>H1 Tag</h1>
    <h2>H2 Tag with <a href="#">link</a></h2>
    <p>First Paragraph</p>
    <p>Second Paragraph</p>
  </body>
</html>
'''

并在shell中输入粘贴命令： %paste，现在要对html_doc的内容用xpath解析，依次输入：

In [9]: sel = Selector(text=html_doc)

In [10]: sel.extract()
Out[10]: u'<html>\n  <head>\n    <title>Title of the page</title>\n  </head>\n  <body>\n    <h1>H1 Tag</h1>\n    <h2>H2 Tag with <a href="#">link</a></h2>\n    <p>First Paragraph</p>\n    <p>Second Paragraph</p>\n  </body>\n</html>'

In [11]: sel.xpath('/html/head/title')
Out[11]: [<Selector xpath='/html/head/title' data=u'<title>Title of the page</title>'>]

In [12]: sel.xpath('/html/head/title').extract()
Out[12]: [u'<title>Title of the page</title>']

In [13]: sel.xpath('//title').extract()
Out[13]: [u'<title>Title of the page</title>']

In [14]: sel.xpath('//text').extract()
Out[14]: []

In [15]: sel.xpath('//text()').extract()
Out[15]: 
[u'\n  ',
 u'\n    ',
 u'Title of the page',
 u'\n  ',
 u'\n  ',
 u'\n    ',
 u'H1 Tag',
 u'\n    ',
 u'H2 Tag with ',
 u'link',
 u'\n    ',
 u'First Paragraph',
 u'\n    ',
 u'Second Paragraph',
 u'\n  ',
 u'\n']

In [16]: sel.xpath('/html/body/p')
Out[16]: 
[<Selector xpath='/html/body/p' data=u'<p>First Paragraph</p>'>,
 <Selector xpath='/html/body/p' data=u'<p>Second Paragraph</p>'>]

In [17]: sel.xpath('/html/body/p').extract()
Out[17]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']

In [18]: sel.xpath('//p').extract()
Out[18]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']

In [19]: sel.xpath('//p[1]').extract()
Out[19]: [u'<p>First Paragraph</p>']

In [20]: sel.xpath('//p[2]').extract()
Out[20]: [u'<p>Second Paragraph</p>']

In [21]: sel.xpath('//p')[0].extract()
Out[21]: u'<p>First Paragraph</p>'

In [22]: sel.xpath('//p')[1].extract()
Out[22]: u'<p>Second Paragraph</p>'

In [23]: sel.xpath('//p/text()')[1].extract()
Out[23]: u'Second Paragraph'

一些xpath工具介绍：

用chrome查看元素xpaht： https://udemy-images.s3.amazonaws.com/redactor/2017-02-12_18-00-40-6fd2add5705fd0f5dbaf66a16683647d/CopyXPath.png
XPath Helper (Chrome Extension)
FireBug (Firefox Extension)
FirePath (Firefox Extension)
XPath Tester： link： http://www.freeformatter.com/xpath-tester.html

习题：

问题 3:
In the following code, how can you extract the URL only?

<a href="http://www.udemy.com">Udemy Platform</a>
答： //a/@href

Scrapy 学习2 xpath简介

习题：

猜你喜欢

热点阅读