爬虫入门

2018-09-10  本文已影响0人  狂浪的心

工具

infoLite

BeautifulSoup使用

from bs4 import BeautifulSoup
html_simple = '\
<html>\
<body>\
<h1 id="title">Hello World</h1>\
<a hred="#" class="link">This is link1</a>\
<a hred="#link2" class="link">This is link2</a>\
</body>
</html>'

soup = BeautifulSoup(html_simple)
print(soup.text)

打印结果:

Hello WorldThis is link1This is link2

通过特定的标签取元素

select

soup = BeautifulSoup(html_simple)
header = soup.select("h1")
print(header)
print(header[0])
print(header[0].text)

alink = soup.select("a")
print(alink)
for link in alink:
    print(link)
    print(link.text)

结果:

[<h1 id="title">Hello World</h1>] //列表
<h1 id="title">Hello World</h1> //第一个元素
Hello World //文本
[<a class="link" hred="#">This is link1</a>, <a class="link" hred="#link2">This is link2</a>]
<a class="link" href="#">This is link1</a>
This is link1
<a class="link" href="#link2">This is link2</a>
This is link2

通过css属性取元素

header = soup.select("#title") #id前面加上#
print(header)
print(header[0])
print(header[0].text)

alink = soup.select(".link")#class前面加上.
print(alink)
for link in alink:
    print(link)
    print(link.text)

结果:

[<h1 id="title">Hello World</h1>]
<h1 id="title">Hello World</h1>
Hello World
[<a class="link" href="#">This is link1</a>, <a class="link" href="#link2">This is link2</a>]
<a class="link" href="#">This is link1</a>
This is link1
<a class="link" href="#link2">This is link2</a>
This is link2

id、class区别
id 唯一标识
class 重复标识

取得标签中的链接

alink = soup.select(".link")#class前面加上.
print(alink)
for link in alink:
    print(link["href"])

结果:

#
#link2

属性通过字典的形式存放,所以可以

ForeSpider爬虫软件

上一篇 下一篇

猜你喜欢

热点阅读