pandas使用方法及相关函数记录1

2019-01-23 本文已影响6人筝韵徽

import pandas as pd
import numpy as np

pandas使用方法记录总结

基本操作记录

查看pandas版本

pd.__version__

'0.23.4'

查看pandas及相关库版本

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.0
pip: 18.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

创建DataFrame及查看DataFrame数据及基本结构信息

df = pd.DataFrame(data,index=labels)
df

image.png

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal      10 non-null object
age         8 non-null float64
visits      10 non-null int64
priority    10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

df.describe()

image.png

返回前三行数据

df.iloc[:3]

image.png

df.head(3)

image.png

选择包含animal和age列的所有数据

df.loc[:,['animal','age']]

image.png

df[['animal','age']]

image.png

选择3,4,6行及animal，age列

df.loc[df.index[[3,4,6]],['animal','age']]

image.png

找出visits > 3的数据

df.loc[df['visits']>=3]

image.png

找出为age== Nan的数据

df.loc[df['age'].isnull()]

image.png

找出animal为cat并且age《3

df.loc[(df['animal']=='cat')&(df['age']<3)]

image.png

找出age在2，4之间的数据

df.loc[df['age'].between(2,4)]

image.png

修改f行age的值为1.5

df.loc['f','age']=1.5

df

image.png

计算一下visits总量

df['visits'].sum()

计算一下animal平均年龄

df.groupby('animal').agg({'age':'mean'})

image.png

df.groupby('animal')['age'].mean()

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

添加一行

df.loc['k']=[2.3,'dog','no',3]

df

image.png

删除一行

df=df.drop('k')

df

image.png

统计个animal的数量

df.animal.value_counts()

cat      4
dog      4
snake    2
Name: animal, dtype: int64

age降序，visits升序排序

df.sort_values(by=['age','visits'],ascending=[False,True])

image.png

priorty 列值改成True|False

df['priority']=df['priority'].map({'yes':True,'no':False})
df

image.png

animal中将snake改成python

df1=df.copy()

df1['animal']=df1['animal'].replace('snake','python')

df1

image.png

df

image.png

df2=df.copy()

df2.loc[df2['animal']=='snake','animal']='python'

df2

image.png

df

image.png

数据类型转换，将age，visits转换成数值类型

df['age'] =pd.to_numeric(df['age'],errors='coerce')

df['visits']=pd.to_numeric(df['visits'],errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal      10 non-null object
age         8 non-null float64
visits      10 non-null int64
priority    10 non-null bool
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 650.0+ bytes

df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal      10 non-null object
age         8 non-null object
visits      10 non-null object
priority    10 non-null bool
dtypes: bool(1), object(3)
memory usage: 650.0+ bytes

df2['age']=df2[['age']].astype('float')

df2['visits']=df2['visits'].astype('int')

df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal      10 non-null object
age         8 non-null float64
visits      10 non-null int32
priority    10 non-null bool
dtypes: bool(1), float64(1), int32(1), object(1)
memory usage: 610.0+ bytes

选择所有数值列

df.select_dtypes('number')

image.png

22.数据透视查询每种动物每个visits的平均年龄

df.pivot_table(index='animal',columns='visits',values='age',aggfunc='mean')

image.png

中级操作记录

shift函数的使用，该函数是对数据进行移动的操作

df3=pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7],'B':[10, 20, 20, 30, 40, 50, 50, 50, 60, 70, 70]})

默认向下移动1，内有数据补位的用Nan

df3['A'].shift()

0     NaN
1     1.0
2     2.0
3     2.0
4     3.0
5     4.0
6     5.0
7     5.0
8     5.0
9     6.0
10    7.0
Name: A, dtype: float64

df3['A'].shift(1)

0     NaN
1     1.0
2     2.0
3     2.0
4     3.0
5     4.0
6     5.0
7     5.0
8     5.0
9     6.0
10    7.0
Name: A, dtype: float64

df3['A'].shift(2)

0     NaN
1     NaN
2     1.0
3     2.0
4     2.0
5     3.0
6     4.0
7     5.0
8     5.0
9     5.0
10    6.0
Name: A, dtype: float64

向上移动

df3['A'].shift(-1)

0     2.0
1     2.0
2     3.0
3     4.0
4     5.0
5     5.0
6     5.0
7     6.0
8     7.0
9     7.0
10    NaN
Name: A, dtype: float64

左右移动

df3.shift(-1,axis=1)

image.png

df3.shift(1,axis=1)

image.png

DataFrame.shift(periods=1, freq=None, axis=0)
参数

periods：类型为int，表示移动的幅度，可以是正数，也可以是负数，默认值是1,1就表示移动一次，注意这里移动的都是数据，而索引是不移动的，移动之后没有对应值的，就赋值为NaN。
freq： DateOffset, timedelta, or time rule string，可选参数，默认值为None，只适用于时间序列，如果这个参数存在，那么会按照参数值移动时间索引，而数据值没有发生变化。例如现在有df1如下：

df4=pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7],'B':[10, 20, 20, 30, 40, 50, 50, 50, 60, 70, 70]},index=pd.date_range('20190101','20190111'))
df4

image.png

import datetime
df4.shift(periods=1,freq=datetime.timedelta(2))

image.png

df4.shift(periods=-1,freq=datetime.timedelta(2))

image.png

示例，找出df3,A中不重复的数据

df3.loc[df3['A'].shift()!=df3['A']]

image.png

df3.drop_duplicates(subset='A')

image.png

df3

image.png

pandas使用方法及相关函数记录1

pandas使用方法记录总结

基本操作记录

中级操作记录

猜你喜欢

热点阅读