pandas极速入门1

2020-01-07 本文已影响0人 python测试开发

Pandas是Python中最受欢迎的软件包之一，广泛用于数据处理。它功能强大且用途广泛，使数据清理和整理变得更加轻松愉快。

Pandas库对python社区做出了巨大贡献，它使python成为数据科学和分析的顶级编程语言之一。它已成为数据分析人员和科学家进行数据分析和处理的首选。

什么是Pandas？

Pandas软件包具有许多功能，这些功能对于数据处理和操作至关重要。简而言之，它可以为您执行以下任务-

创建类似于R的数据框和Excel电子表格的结构化数据集。
从各种来源读取数据，例如CSV，TXT，XLSX，SQL数据库，R等。
从数据集中选择特定的行或列
按升序或降序排列数据
根据某些条件过滤数据
按分类变量汇总数据
将数据重塑为宽或长格式
时间序列分析
合并和串联两个数据集
遍历数据集的行
以CSV或Excel格式写入或导出数据

重要的Pandas函数

功能	函数
提取列名称	df.columns
选择前两行	df.iloc [：2]
选择前两列	df.iloc [：，：2]
按名称选择列	df.loc [：，[“ col1”，“ col2”]]
选择随机编号行数	df.sample（n = 10）
选择随机行的分数	df.sample（分数= 0.2）
重命名变量	df.rename（）
选择列作为索引	df.set_index（）
删除行或列	df.drop（）
排序值	df.sort_values（）
分组变量	df.groupby（）
筛选	df.query（）
寻找缺失的值	df.isnull（）
删除缺失值	df.dropna（）
删除重复项	df.drop_duplicates（）
制作视图	pd.get_dummies（）
排名	df.rank（）
累计金额	df.cumsum（）
分位数	df.quantile（）
选择数值变量	df.select_dtypes（）
连接两个数据框	pd.concat（）
基于公共变量合并	pd.merge（）

数据集与快速入门

income和iris，可以在扣扣群630011153 144081101找到。


>>> import pandas as pd
>>> income = pd.read_csv("/home/andrew/income.csv")
>>> income.head()
  Index       State    Y2002    Y2003    Y2004    Y2005  ...    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015
0     A     Alabama  1296530  1317711  1118631  1492583  ...  1237582  1440756  1186741  1852841  1558906  1916661
1     A      Alaska  1170302  1960378  1818085  1447852  ...  1629616  1230866  1512804  1985302  1580394  1979143
2     A     Arizona  1742027  1968140  1377583  1782199  ...  1300521  1130709  1907284  1363279  1525866  1647724
3     A    Arkansas  1485531  1994927  1119299  1947979  ...  1669295  1928238  1216675  1591896  1360959  1329341
4     C  California  1685349  1675807  1889570  1480280  ...  1624509  1639670  1921845  1156536  1388461  1644607

[5 rows x 16 columns]
>>> income.columns
Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
       'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
      dtype='object')
>>> income.dtypes
Index    object
State    object
Y2002     int64
Y2003     int64
Y2004     int64
Y2005     int64
Y2006     int64
Y2007     int64
Y2008     int64
Y2009     int64
Y2010     int64
Y2011     int64
Y2012     int64
Y2013     int64
Y2014     int64
Y2015     int64
dtype: object
>>> income['State'].dtypes
dtype('O')
>>> income.Y2008 = income.Y2008.astype(float)
>>> income.Y2008.dtypes
dtype('float64')
>>> income.shape
(51, 16)

>>> income[0:5]
  Index       State    Y2002    Y2003    Y2004    Y2005  ...    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015
0     A     Alabama  1296530  1317711  1118631  1492583  ...  1237582  1440756  1186741  1852841  1558906  1916661
1     A      Alaska  1170302  1960378  1818085  1447852  ...  1629616  1230866  1512804  1985302  1580394  1979143
2     A     Arizona  1742027  1968140  1377583  1782199  ...  1300521  1130709  1907284  1363279  1525866  1647724
3     A    Arkansas  1485531  1994927  1119299  1947979  ...  1669295  1928238  1216675  1591896  1360959  1329341
4     C  California  1685349  1675807  1889570  1480280  ...  1624509  1639670  1921845  1156536  1388461  1644607

[5 rows x 16 columns]
>>> income.iloc[0:5]
  Index       State    Y2002    Y2003    Y2004    Y2005  ...    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015
0     A     Alabama  1296530  1317711  1118631  1492583  ...  1237582  1440756  1186741  1852841  1558906  1916661
1     A      Alaska  1170302  1960378  1818085  1447852  ...  1629616  1230866  1512804  1985302  1580394  1979143
2     A     Arizona  1742027  1968140  1377583  1782199  ...  1300521  1130709  1907284  1363279  1525866  1647724
3     A    Arkansas  1485531  1994927  1119299  1947979  ...  1669295  1928238  1216675  1591896  1360959  1329341
4     C  California  1685349  1675807  1889570  1480280  ...  1624509  1639670  1921845  1156536  1388461  1644607

[5 rows x 16 columns]

>>> s = pd.Series([1,2,3,1,2], dtype="category")
>>> s
0    1
1    2
2    3
3    1
4    2
dtype: category
Categories (3, int64): [1, 2, 3]
>>> income.Index.unique()
array(['A', 'C', 'D', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P',
       'R', 'S', 'T', 'U', 'V', 'W'], dtype=object)
>>> income.Index.nunique()
19

生成交叉表

pd.crosstab( )用于创建双变量频率分布。

>>> pd.crosstab(income.Index,income.State)
State  Alabama  Alaska  Arizona  Arkansas  California  ...  Virginia  Washington  West Virginia  Wisconsin  Wyoming
Index                                                  ...
A            1       1        1         1           0  ...         0           0              0          0        0
C            0       0        0         0           1  ...         0           0              0          0        0
D            0       0        0         0           0  ...         0           0              0          0        0

频率分布

>>> income.Index.value_counts(ascending = True)
G    1
F    1
H    1
P    1
L    1
U    1
R    1
V    2
T    2
K    2
S    2
D    2
C    3
O    3
W    4
A    4
I    4
M    8
N    8
Name: Index, dtype: int64

取样

>>> income.Index.value_counts(ascending = True)
G    1
F    1
H    1
P    1
L    1
U    1
R    1
V    2
T    2
K    2
S    2
D    2
C    3
O    3
W    4
A    4
I    4
M    8
N    8
Name: Index, dtype: int64

选择行列

income["State"]
income.State
income[["Index","State","Y2008"]]
income.loc[:,["Index","State","Y2008"]]
income.loc[0:2,["Index","State","Y2008"]]  #Selecting rows with Index label 0 to 2 & columns
income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
#In the above command both Index and Y2008 are included.
income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included

loc考虑具有索引中特定标签的行或列。而iloc考虑在索引中特定位置的行或列，因此它仅采用整数。

>>> import numpy as np
>>> x = pd.DataFrame({"var1" : np.arange(1,20,2)}, index=[9,8,7,6,10, 1, 2, 3, 4, 5])
>>> x
    var1
9      1
8      3
7      5
6      7
10     9
1     11
2     13
3     15
4     17
5     19
>>> x.iloc[:3]
   var1
9     1
8     3
7     5
>>> x.loc[:3]
    var1
9      1
8      3
7      5
6      7
10     9
1     11
2     13
3     15

重命名变量

>>> data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
>>> data
       A          B
0   John      Libra
1   Mary  Capricorn
2  Julia      Aries
3  Kenny    Scorpio
4  Henry   Aquarius
>>> data.columns = ['Names','Zodiac Signs']
>>> data
   Names Zodiac Signs
0   John        Libra
1   Mary    Capricorn
2  Julia        Aries
3  Kenny      Scorpio
4  Henry     Aquarius
>>> data.rename(columns = {"Names":"Cust_Name"},inplace = True)
>>> data
  Cust_Name Zodiac Signs
0      John        Libra
1      Mary    Capricorn
2     Julia        Aries
3     Kenny      Scorpio
4     Henry     Aquarius
>>> income.columns = income.columns.str.replace('Y' , 'Year ')
>>> income.columns
Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
       'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
       'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
      dtype='object')