机器学习实战大数据,机器学习,人工智能人工智能/模式识别/机器学习精华专题

爬虫篇(1)——从爬取练习题开始

2018-02-06  本文已影响71人  飘涯

前言:
介绍小例子,加深对爬虫的理解,主要用bs4完成

主页面:


image.png

副页面:


image.png
from bs4 import BeautifulSoup
import requests
url = "http://www.runoob.com/python/python-100-examples.html"
#发送请求
content = requests.get(url,params=None).content.decode("utf-8")

html = BeautifulSoup(content,"html.parser")
# print(type(html))
#查找每个练习的a标签的href属性
a=html.find(id="content").ul.find_all("a")
#创建一个列表保存url
url_list=[]
for x in a:
    url_list.append("http://www.runoob.com"+x["href"])
datas=[]
for i in range(100):
    dic = {}
    html01 = requests.get(url_list[i]).content.decode("utf-8")
    soup02 = BeautifulSoup(html01, "html.parser")
    dic['title'] = soup02.find(id="content").h1.text
    # 题目
    dic['content01'] = soup02.find(id="content").p.next_sibling.next_sibling.text
    # print(content01)
    # 程序分析
    dic['content02'] = soup02.find(id="content").p.next_sibling.next_sibling.next_sibling.next_sibling.text
    try:
        dic['content03'] = soup02.find(class_="hl-main").text
    except Exception as e:
        dic["content03"] = soup02.find("pre").text
   
    with open("100_py.csv","a+",encoding="utf-8") as file:
        file.write(dic['title']+"\n")
        file.write(dic['content01']+"\n")
        file.write(dic['content02']+"\n")
        file.write(dic['content03']+"\n")
        file.write("*"*60+"\n")

结果:
可以看到有四千多行数据


image.png

后记:
bs4中的find方法查找标签太麻烦,还是推荐用xpath

爬虫篇(4)——qq音乐爬取
爬虫篇(3)——招聘信息爬取
爬虫篇(2)——爬取博客内容

上一篇下一篇

猜你喜欢

热点阅读