爬虫2 BeautifulSoup

2021-04-26 本文已影响0人若晴y

image.png
第0关的requests库帮我们搞定了爬虫第0步——获取数据；第1关的HTML知识，是进行爬虫必不可少的背景知识，能辅助我们解析和提取数据
接下来，解析和提取的部分就交给灵活又方便的网页解析库BeautifulSoup。
那么，本关学习目标：学会使用BeautifulSoup解析和提取网页中的数据。
解析数据】是什么意思呢？

icon

我们平时使用浏览器上网，浏览器会把服务器返回来的HTML源代码翻译为我们能看懂的样子，之后我们才能在网页上做各种操作。
而在爬虫中，也要使用能读懂html的工具，才能提取到想要的数据。

image.png

这就是解析数据。

icon

【提取数据】是指把我们需要的数据从众多数据中挑选出来。
老师还想提醒一下：解析与提取数据在爬虫中，既是一个重点，也是难点。因为这一关要讲两步，信息量会比之前两关大，所以希望你在学习的时候，能做好一定的心理准备，投入更多精力。
import requests
from bs4 import BeautifulSoup
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
html = res.text
soup = BeautifulSoup( html,'html.parser')
items = soup.find_all(class_='books')
for item in items:
kind = item.find('h2')
title = item.find(class_='title')
brief = item.find(class_='info')
print(kind.text,'\n',title.text,'\n',title['href'],'\n',brief.text)

爬虫2 BeautifulSoup

猜你喜欢

热点阅读