pandas使用方法及相关函数记录1
2019-01-23 本文已影响6人
筝韵徽
import pandas as pd
import numpy as np
pandas使用方法记录总结
基本操作记录
- 查看pandas版本
pd.__version__
'0.23.4'
- 查看pandas及相关库版本
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.8.0
pip: 18.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
- 创建DataFrame及查看DataFrame数据及基本结构信息
df = pd.DataFrame(data,index=labels)
df

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal 10 non-null object
age 8 non-null float64
visits 10 non-null int64
priority 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
df.describe()

- 返回前三行数据
df.iloc[:3]

df.head(3)

- 选择包含animal和age列的所有数据
df.loc[:,['animal','age']]

df[['animal','age']]

- 选择3,4,6行及animal,age列
df.loc[df.index[[3,4,6]],['animal','age']]

- 找出visits > 3的数据
df.loc[df['visits']>=3]

- 找出为age== Nan的数据
df.loc[df['age'].isnull()]

- 找出animal为cat并且age《3
df.loc[(df['animal']=='cat')&(df['age']<3)]

- 找出age在2,4之间的数据
df.loc[df['age'].between(2,4)]

- 修改f行age的值为1.5
df.loc['f','age']=1.5
df

- 计算一下visits总量
df['visits'].sum()
19
- 计算一下animal平均年龄
df.groupby('animal').agg({'age':'mean'})

df.groupby('animal')['age'].mean()
animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64
- 添加一行
df.loc['k']=[2.3,'dog','no',3]
df

- 删除一行
df=df.drop('k')
df

- 统计个animal的数量
df.animal.value_counts()
cat 4
dog 4
snake 2
Name: animal, dtype: int64
- age降序,visits升序排序
df.sort_values(by=['age','visits'],ascending=[False,True])

- priorty 列值改成True|False
df['priority']=df['priority'].map({'yes':True,'no':False})
df

- animal中将snake改成python
df1=df.copy()
df1['animal']=df1['animal'].replace('snake','python')
df1

df

df2=df.copy()
df2.loc[df2['animal']=='snake','animal']='python'
df2

df

- 数据类型转换,将age,visits转换成数值类型
df['age'] =pd.to_numeric(df['age'],errors='coerce')
df['visits']=pd.to_numeric(df['visits'],errors='coerce')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal 10 non-null object
age 8 non-null float64
visits 10 non-null int64
priority 10 non-null bool
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 650.0+ bytes
df2.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal 10 non-null object
age 8 non-null object
visits 10 non-null object
priority 10 non-null bool
dtypes: bool(1), object(3)
memory usage: 650.0+ bytes
df2['age']=df2[['age']].astype('float')
df2['visits']=df2['visits'].astype('int')
df2.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal 10 non-null object
age 8 non-null float64
visits 10 non-null int32
priority 10 non-null bool
dtypes: bool(1), float64(1), int32(1), object(1)
memory usage: 610.0+ bytes
- 选择所有数值列
df.select_dtypes('number')

22.数据透视 查询每种动物每个visits的平均年龄
df.pivot_table(index='animal',columns='visits',values='age',aggfunc='mean')

中级操作记录
- shift函数的使用,该函数是对数据进行移动的操作
df3=pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7],'B':[10, 20, 20, 30, 40, 50, 50, 50, 60, 70, 70]})
- 默认向下移动1,内有数据补位的用Nan
df3['A'].shift()
0 NaN
1 1.0
2 2.0
3 2.0
4 3.0
5 4.0
6 5.0
7 5.0
8 5.0
9 6.0
10 7.0
Name: A, dtype: float64
df3['A'].shift(1)
0 NaN
1 1.0
2 2.0
3 2.0
4 3.0
5 4.0
6 5.0
7 5.0
8 5.0
9 6.0
10 7.0
Name: A, dtype: float64
df3['A'].shift(2)
0 NaN
1 NaN
2 1.0
3 2.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 6.0
Name: A, dtype: float64
- 向上移动
df3['A'].shift(-1)
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 5.0
6 5.0
7 6.0
8 7.0
9 7.0
10 NaN
Name: A, dtype: float64
- 左右移动
df3.shift(-1,axis=1)

df3.shift(1,axis=1)

DataFrame.shift(periods=1, freq=None, axis=0)
参数
- periods:类型为int,表示移动的幅度,可以是正数,也可以是负数,默认值是1,1就表示移动一次,注意这里移动的都是数据,而索引是不移动的,移动之后没有对应值的,就赋值为NaN。
- freq: DateOffset, timedelta, or time rule string,可选参数,默认值为None,只适用于时间序列,如果这个参数存在,那么会按照参数值移动时间索引,而数据值没有发生变化。例如现在有df1如下:
df4=pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7],'B':[10, 20, 20, 30, 40, 50, 50, 50, 60, 70, 70]},index=pd.date_range('20190101','20190111'))
df4

import datetime
df4.shift(periods=1,freq=datetime.timedelta(2))

df4.shift(periods=-1,freq=datetime.timedelta(2))

- 示例,找出df3,A中不重复的数据
df3.loc[df3['A'].shift()!=df3['A']]

df3.drop_duplicates(subset='A')

df3
