python之HTMLParser解析HTML文档

2017-09-25 本文已影响527人 00fce043cf44

python之HTMLParser解析HTML文档

HTMLParser是Python自带的模块，使用简单，能够很容易的实现HTML文件的分析。
本文主要简单讲一下HTMLParser的用法.

使用时需要定义一个从类HTMLParser继承的类，重定义函数：

handle_starttag( tag, attrs)
handle_startendtag( tag, attrs)
handle_endtag( tag)
handle_data(data)

1. 获取标签属性

tag是的html标签，attrs是 (属性，值)元组(tuple)的列表(list).

如一个标签为：<input type="hidden" name="NXX" id="IDXX" value="VXX" />

那么它的attrs列表为[('type', 'hidden'), ('name', 'NXX'), ('id', 'IDXX'), ('value', 'VXX')]
HTMLParser自动将tag和attrs都转为小写。

下面给出的例子抽取了html中的所有链接：

from HTMLParser import HTMLParser
  
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.links = []
  
    def handle_starttag(self, tag, attrs):
        #print "Encountered the beginning of a %s tag" % tag
        if tag == "a":
            if len(attrs) == 0: pass
            else:
                for (variable, value)  in attrs:
                    if variable == "href":
                        self.links.append(value)
  
if __name__ == "__main__":
    html_code = """
    <a href="www.google.com"> google.com</a>
    <A Href="www.pythonclub.org"> PythonClub </a>
    <A HREF = "www.sina.com.cn"> Sina </a>
    """
    hp = MyHTMLParser()
    hp.feed(html_code)
    hp.close()
    print(hp.links)

输出为：

['www.google.com', 'www.pythonclub.org', 'www.sina.com.cn']

如果想抽取图形链接：
![](http://www.google.com/intl/zh-CN_ALL/images/logo.gif)
就要重定义 handle_startendtag( tag, attrs) 函数

2. 获取标签内容

test1.html文件内容如下：

<html>
<head>
<title> XHTML 与 HTML 4.01 标准没有太多的不同</title>
</head>
<body>
i love you
</body>
</html>

2.1 第一个例子

import HTMLParser
 
class TitleParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        # self.taglevels=[]
        self.handledtags = ['title','body']
        self.processing = None
 
    def handle_starttag(self,tag,attrs):
        print '--------------'
        print 'handle start func',tag
 
    def handle_endtag(self,tag):
        print '================'
        print 'handle end func',tag
 
if __name__ == '__main__':
    fd=open('test1.html')
    tp=TitleParser()
    tp.feed(fd.read())

运行结果：

--------------
handle start func html
--------------
handle start func head
--------------
handle start func title
=======================
handle end func title
=======================
handle end func head
--------------
handle start func body
=======================
handle end func body
=======================
handle end func html

相信大家已经看出来了，解析时碰到<>，自动调用handle_starttag()；碰到</>，自动调用handle_endtag()

2.2 添加handle_data方法

import HTMLParser
 
class TitleParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        # self.taglevels=[]
        self.handledtags = ['title','body']
        self.processing = None
 
    def handle_starttag(self,tag,attrs):
        print '--------------'
        print 'handle start func',tag
 
    def handle_data(self,data):
        print '####'
        print 'handle data func'
        if data == '\n':
            print r'\n'
        else:
            print data,
 
    def handle_endtag(self,tag):
        print '======================='
        print 'handle end func',tag
 
if __name__ == '__main__':
    fd=open('test1.html')
    tp=TitleParser()
    tp.feed(fd.read())

运行结果：

--------------
handle start func html
####
handle data func
\n
--------------
handle start func head
####
handle data func
\n
--------------
handle start func title
####
handle data func
 XHTML 与 HTML 4.01 标准没有太多的不同 =======================
handle end func title
####
handle data func
\n
=======================
handle end func head
####
handle data func
\n
--------------
handle start func body
####
handle data func
 
i love you
=======================
handle end func body
####
handle data func
\n
=======================
handle end func html

说明：

每一个标签，无论<> 还是</>，均会调用handle_data()
html中第一行、第二行分别为<html>和<head>，后面无具体数据，只有回车换行，所用调用handle_data()，打印结果为换行；</html></head>同理。

2.2.1 解析需要的内容

import HTMLParser
 
class TitleParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.handledtags = ['title','body']
        self.processing = None
        self.data = []
 
    def handle_starttag(self,tag,attrs):
        if tag in self.handledtags:
            self.processing = tag
 
    def handle_data(self,data):
        if self.processing:
            self.data.append(data)
 
    def handle_endtag(self,tag):
        if tag == self.processing:
            self.processing = None
 
if __name__ == '__main__':
    fd = open('test1.html')
    tp = TitleParser()
    tp.feed(fd.read())
    for each in tp.data:
        print each

运行结果：

XHTML 与 HTML 4.01 标准没有太多的不同

i love you

2.3 解析豆瓣热门电影实例

from html.parser import HTMLParser
from urllib import request
import ssl
# 取消ssl验证
ssl._create_default_https_context = ssl._create_unverified_context

class MyHTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.movies = []


    def handle_starttag(self, tag, attrs):
        print('star: <%s> 属性 %s' % (tag ,attrs))
        def _attr(attrlist, attrname):
            for each in attrlist:
                if attrname == each[0]:
                    return each[1]
        if tag == 'li' and _attr(attrs, 'data-title'):
            movie = {}
            movie['actors'] = _attr(attrs, 'data-actors')
            movie['actors'] = _attr(attrs, 'data-actors')
            movie['director'] = _attr(attrs, 'data-director')
            movie['duration'] = _attr(attrs, 'data-dutation')
            movie['title'] = _attr(attrs, 'data-title')
            movie['rate'] = _attr(attrs, 'data-rate')
            print(_attr(attrs, 'data-actors'))
            self.movies.append(movie)


    def handle_endtag(self, tag):
        print('end: </%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('startendtag :<%s/> 结尾属性 %s' % (tag,attrs))


    def handle_data(self, data):
         print('所有data %s' % data)


    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):
        print('&%s;' % name)

    def handle_charref(self, name):
        print('&#%s;' % name)

def movieparser(url):
    myparser = MyHTMLParser()
    with request.urlopen(url) as f:
        data = f.read().decode('utf-8')
        myparser.feed(data)
        myparser.close()

    return myparser.movies

if __name__ == '__main__':
    url = 'https://movie.douban.com/'
    movies = movieparser(url)
    for each in movies:
        print('%(title)s|%(rate)s|%(actors)s|%(director)s|%(duration)s' % each)

运行结果：

猩球崛起3：终极之战 War for the Planet of the Apes|7.1|安迪·瑟金斯 / 伍迪·哈里森 / 史蒂夫·茨恩|马特·里夫斯|None
王牌保镖 The Hitman's Bodyguard|7.3|瑞恩·雷诺兹 / 塞缪尔·杰克逊 / 加里·奥德曼|帕特里克·休斯|None
羞羞的铁拳||艾伦 / 马丽 / 沈腾|宋阳|None
看不见的客人 Contratiempo|8.7|马里奥·卡萨斯 / 阿娜·瓦格纳 / 何塞·科罗纳|奥里奥尔·保罗|None
缝纫机乐队||大鹏 / 乔杉 / 娜扎|大鹏|None
蜘蛛侠：英雄归来 Spider-Man: Homecoming|7.5|汤姆·霍兰德 / 小罗伯特·唐尼 / 玛丽莎·托梅|乔·沃茨|None
英伦对决 The Foreigner||成龙 / 皮尔斯·布鲁斯南 / 刘涛|马丁·坎贝尔|None
空天猎||李晨 / 范冰冰 / 王千源|李晨|None
追龙 追龍||甄子丹 / 刘德华 / 姜皓文|王晶|None
托马斯大电影之了不起的比赛 Thomas & Friends: The Great Race||马克·莫拉根 / 大卫·拜德拉 / 奥利维娅·科尔曼|大卫·斯特登|None
惊天解密 Unlocked|5.7|劳米·拉佩斯 / 托妮·科莱特 / 奥兰多·布鲁姆|迈克尔·艾普特|None
敦刻尔克 Dunkirk|8.6|菲恩·怀特海德 / 汤姆·格林-卡尼 / 杰克·劳登|克里斯托弗·诺兰|None
战狼2|7.4|吴京 / 弗兰克·格里罗 / 吴刚|吴京|None
捍卫者||白恩 / 吕星辰 / 赫子铭|廖希|None
刀剑神域：序列之争 劇場版 ソードアート・オンライン -オーディナル・スケール|7.2|松冈祯丞 / 户松遥 / 伊藤加奈惠|伊藤智彦|None
极致追击||奥兰多·布鲁姆 / 吴磊 / 任达华|Charles|None
天梯：蔡国强的艺术|8.6|蔡国强 / 蔡文悠 / 蔡文浩|凯文·麦克唐纳|None
昆塔：反转星球||李正翔 / 洪海天 / 陶典|李炼|None
理查大冒险 Richard the Stork||尼科莱特·克雷比茨 / Marc / Jason|托比·格恩科尔|None
银魂 銀魂|7.4|小栗旬 / 菅田将晖 / 桥本环奈|福田雄一|None
画室惊魂||罗翔 / 杨欣 / 陈美行|邢博|None
钢铁飞龙之再见奥特曼||侯勇 / 大张伟 / 金晨|王巍|None
海边的曼彻斯特 Manchester by the Sea|8.6|卡西·阿弗莱克 / 卢卡斯·赫奇斯 / 米歇尔·威廉姆斯|肯尼斯·罗纳根|None
星际特工：千星之城 Valérian and the City of a Thousand Planets|7.2|戴恩·德哈恩 / 卡拉·迪瓦伊 / 克里夫·欧文|吕克·贝松|None
纯洁心灵·逐梦演艺圈|2.0|朱哲健 / 李彦漫 / 陈思瀚|毕志飞|None
建军大业||刘烨 / 朱亚文 / 黄志忠|刘伟强|None
声之形 聲の形|6.9|入野自由 / 早见沙织 / 松冈茉优|山田尚子|None
地球：神奇的一天 Earth: One Amazing Day|8.2|成龙 / 罗伯特·雷德福|理查德·戴尔|None
奋斗|7.5|陈燕燕 / 郑君里 / 袁丛美|史东山|None
谜证||苗侨伟 / 袁嘉敏 / 桑平|汪洋|None
初恋时光||黄又南 / 邓紫衣 / 叶山豪|殷国君|None
疯狂旅程||刘亮 / 白鸽 / 陆进|龙野|None
十万个冷笑话2|7.7|山新 / 郝祥海 / 李姝洁|卢恒宇|None
心理罪|5.4|廖凡 / 李易峰 / 万茜|谢东燊|None
魔都爱之十二星座||李梓溪 / 孙立洋 / 马璐|唐昱|None
请勿靠近||马可 / 李毓芬 / 郑雅文|张显|None
三生三世十里桃花|4.0|刘亦菲 / 杨洋 / 罗晋|赵小丁|None

python之HTMLParser解析HTML文档