pandas 的基本介绍

2022-08-23 本文已影响0人笔头还没烂

简单介绍

Python 在数据处理上独步天下：代码灵活、开发快速；尤其是 Python 的 Pandas 包，无论是在数据分析领域、还是大数据开发场景，都具有显著的优势；

Pandas 是 Python 的一个第三方包，也是商业和工程领域最流行的结构化数据工具集，用于数据清洗、处理以及分析；
Pandas 和 Spark有很多功能都类似，甚至使用方法都是相同的；当我们学会 Pandas 之后，再学习 Spark 就更加简单快速
Pandas 在整个数据开发的流程中的应用场景：在大数据场景下，数据在流转的过程中，Python Pandas 丰富的 API 能够更加灵活、快速地对数据进行清洗和处理
Pandas 在数据处理上具有独特的优势：
（1）底层是基于 Numpy 构建的，所以运行速度特别的快
（2）有专门的处理缺失数据的 API
（3）强大而灵活的分组、聚合、转换功能

适用场景

数据量大到 excel 严重卡顿，且又都是单机数据的时候，我们使用 pandas
（1）pandas 用于处理单机数据（小数据集（相对于大数据来说））
在大数据 ETL 数据仓库中，对数据进行清洗及处理的环节使用 pandas

安装 pandas 的库

windows上打开 cmd 界面，或者在 mac 平台打开终端 termial 界面，执行下面命令即可完成安装。

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ pandas

解释：
（1）https://pypi.tuna.tsinghua.edu.cn/simple/ 此地址为清华镜像的地址，在国内下载 python 相关包的时候建议使用此地址；
（2）mac平台系统自带pip3工具，可以在“/Users/用户名xxx/Library/Python/3.8/bin”目录下找到，如果安装pandas时终端提示找不到 pip3 或者 pip 命令，可以将 "/Users/用户名xxx/Library/Python/3.8/bin" 添加到环境变量中。路径中的用户名需改成与自己mac系统对应的用户名即可。
添加环境变量的步骤如下：
（1）在 Home 目录下创建 .bash_profile文件（如果有，则直接打开），命令如下:

cd ~
vim .bash_profile

（2）在.bash_profile文件内容的最后一行追加以下内容，注意，用户名需修改成与自己mac系统对应的用户名

#PIP_HOME
PATH="/Users/与自己mac系统对应的用户名xxx/Library/Python/3.8/bin:${PATH}"
export PATH

（3）执行source命令，让环境变量生效，如下：

source ~/.bash_profile

这样，你就可以在终端使用 pip 或者 pip3 命令了。

最后，下面以mac平台为例，安装过程及相关日志如下所示：

a1@1deMac-mini ~ % pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pandas
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c8/85/8afe540bd0299c4d58f0a5b88acc49a8021804abe05a00d2cbc2fccde873/pandas-1.4.3-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.4/11.4 MB 1.7 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/d5/50/54451e88e3da4616286029a3a17fc377de817f66a0f50e1faaee90161724/pytz-2022.2.1-py2.py3-none-any.whl (500 kB)
Collecting python-dateutil>=2.8.1
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting numpy>=1.18.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fc/90/fa2ca0f2fcabbfd970e1e78f820d8639683c36525e1c89d9bd20e69230a7/numpy-1.23.2-cp38-cp38-macosx_10_9_x86_64.whl (18.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.1/18.1 MB 2.6 MB/s eta 0:00:00
Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas) (1.15.0)
Installing collected packages: pytz, python-dateutil, numpy, pandas
  WARNING: The scripts f2py, f2py3 and f2py3.8 are installed in '/Users/a1/Library/Python/3.8/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed numpy-1.23.2 pandas-1.4.3 python-dateutil-2.8.2 pytz-2022.2.1
a1@1deMac-mini ~ %

终端出现 Successfully installed 即为安装成功。

pandas 初体验

我们可以打开 Pycharm，新建一个空的工程，配置好相关环境之后，我们在工程目录下新建两个目录，一个 datas 目录（用于放测试数据。导入测试数据时，需选择"GBK"编码，否则会有乱码），一个是src目录（用于放我们的源代码）。在src目录下新建一个python 文件（如test.py），并编辑以下代码：

# 1 - 导入pandas 的包
import pandas as pd

# 需求：假设目上前有一份Excel数据，存储各年份全球GDP数据，筛选出1960年的数据
df = pd.read_csv('../datas/1960-2019全球GDP数据.csv',encoding='GBK')

# 读取文件全部内容
# print(df)

#3 - 获取1960年GDP数据
gdp = df[df.year == 1960]
print(gdp)

工程配置以及运行结果如下图所示：

image.png

可以看到，通过pandas，我们能够轻松地将csv、excel等文件的内容转化成一张表，并对表中的内容进行分析等操作。

如果需要测试数据，可通过下面的链接下载：
链接: https://pan.baidu.com/s/1a-KHNrOI2SlpftbdW5uxxA 提取码: he0j

以上，感谢阅读！

pandas 的基本介绍

简单介绍

适用场景

安装 pandas 的库

pandas 初体验

猜你喜欢

热点阅读