Pandas.read_html() 获取静态网页表格数据

2019-04-26 本文已影响50人一只小菠菜

环境：Win10 + Cmder + Python3.6.5

需求

获取 http://www.air-level.com/air/xian/ 的空气质量指数表格数据。骚年，是不是蠢蠢欲动要爬虫三步走了？

代码

我说三行代码就可以轻松搞定, 你信吗？（正经脸）：

import pandas as pd
df = pd.read_html("http://www.air-level.com/air/xian/", encoding='utf-8', header=0)[0]
df.to_excel('xian_tianqi.xlsx', index=False)

然后先来看网页数据：

再来看Excel中的数据：

是不是被秀到啦？讲真，我也被秀到一脸...

解释

read_html()部分源码如下：

# 已省略部分代码，详细查看可在命令行执行：print(pd.read_html.__doc__)
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=None, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True, displayed_only=True):
    r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.

  Parameters
    ----------
    io : str or file-like
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.

    flavor : str or None, container of strings
        The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
        each other, they are both there for backwards compatibility. The
        default of ``None`` tries to use ``lxml`` to parse and if that fails it
        falls back on ``bs4`` + ``html5lib``.

     header : int or list-like or None, optional
        The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
        make the columns headers.
......

可以看到，read_html() 方法的io参数默认了多种形式，URL便是其中一种。然后函数默认调用lxml解析table标签里的每个td的数据，最后生成一个包含Dataframe对象的列表。通过索引获取到DataFrame对象即可。

最后

read_html() 仅支持静态网页解析。你可以通过其他方法获取动态页面加载后response.text 传入read_html() 再获取表格数据。

参考：https://mp.weixin.qq.com/s/CuhC7rCD6LPXLO88JVEuJg

Pandas.read_html() 获取静态网页表格数据

需求

代码

解释

最后

猜你喜欢

热点阅读