爬不动的毛毛虫Python数据分析IT@程序员猿媛

Pandas.read_html() 获取静态网页表格数据

2019-04-26  本文已影响50人  一只小菠菜

环境:Win10 + Cmder + Python3.6.5

需求

  获取 http://www.air-level.com/air/xian/ 的空气质量指数表格数据。骚年,是不是蠢蠢欲动要爬虫三步走了?

代码

我说三行代码就可以轻松搞定, 你信吗?(正经脸):

import pandas as pd
df = pd.read_html("http://www.air-level.com/air/xian/", encoding='utf-8', header=0)[0]
df.to_excel('xian_tianqi.xlsx', index=False)

  然后先来看网页数据:

  再来看Excel中的数据:


  是不是被秀到啦?讲真,我也被秀到一脸...

解释

  read_html()部分源码如下:

# 已省略部分代码,详细查看可在命令行执行:print(pd.read_html.__doc__)
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=None, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True, displayed_only=True):
    r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.

  Parameters
    ----------
    io : str or file-like
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.

    flavor : str or None, container of strings
        The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
        each other, they are both there for backwards compatibility. The
        default of ``None`` tries to use ``lxml`` to parse and if that fails it
        falls back on ``bs4`` + ``html5lib``.

     header : int or list-like or None, optional
        The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
        make the columns headers.
......

  可以看到,read_html() 方法的io参数默认了多种形式,URL便是其中一种。然后函数默认调用lxml解析table标签里的每个td的数据,最后生成一个包含Dataframe对象的列表。通过索引获取到DataFrame对象即可。

最后

  read_html() 仅支持静态网页解析。你可以通过其他方法获取动态页面加载后response.text 传入read_html() 再获取表格数据。

参考https://mp.weixin.qq.com/s/CuhC7rCD6LPXLO88JVEuJg

上一篇下一篇

猜你喜欢

热点阅读