pandas二刷(1)
2023-02-11 本文已影响0人
山猪打不过家猪
1.读取csv文件
1.1读取csv
1.1.1全表读取
df = pd.read_csv('1.csv')
1.1.2根据列的index读取
cols_index= [1] #索引是从0开始,所以这里是第二列的值
df = pd.read_csv('1.csv',usecols=cols_index)
df.head()
![](https://img.haomeiwen.com/i14814834/92b6e9ee15d345ad.png)
1.1.3根据列名读取
cols_name= ['Id','dateTime','name']
df = pd.read_csv('1.csv',usecols=cols_name)
df.head()
1.1.4无表头读取
df = pd.read_csv('1.csv',header = None)
![](https://img.haomeiwen.com/i14814834/fe81caae25f7f899.png)
- 这种方式对于没有表头的数据适应
1.1.5有表头读取
- 如果表头有空行可以使用Index来跳过空行
df = pd.read_csv('1.csv',header=1)
df.head()
1.1.6跳行读取
df = pd.read_csv('1.csv',skiprows=1)
df.head()
1.1.7查看列名
df = pd.read_csv('directory.csv')
df.columns
![](https://img.haomeiwen.com/i14814834/0d8650d3c8eb768a.png)
1.2获取表中数据类型
1.2.1 获取所有列的信息
import pandas as pd
df = pd.read_csv('1.csv',header=None)
df.info()
![](https://img.haomeiwen.com/i14814834/d247d11ec1ad449a.png)
1.2.2获取列数和行数
df = pd.read_csv('directory.csv')
df.shape
![](https://img.haomeiwen.com/i14814834/bd9459847b2e8f5f.png)
1.2.3只查看所有数据类型
df = pd.read_csv('directory.csv')
df.dtypes
![](https://img.haomeiwen.com/i14814834/1c720d0dfe4fd8b4.png)
1.3常用数据查询
1.3.1将指定的列设为索引
-
把日期作为索引
image.png
df = pd.read_csv('weather.csv')
df.set_index('ymd',inplace=True)
df.index#查看索引
1.3.2 获取指定的列
![](https://img.haomeiwen.com/i14814834/a9c664785a82124f.png)
根据列名获取tainqi
这列的所有值
df = pd.read_csv('weather.csv')
df.set_index('ymd',inplace=True)
df['tianqi']
根据索引获取tianqi
所在的第二列所有行数据
df = pd.read_csv('weather.csv')
df.set_index('ymd',inplace=True)
df.iloc[:,2]
1.3.3 根据区间取值
根据列名取第一行的所有数据
df.loc['2017-01-01']
根据行索引区取第一行所有数据
#0是第一行的索引,:表示所有列
df.iloc[0,:]
![](https://img.haomeiwen.com/i14814834/afd363d68fdd79e9.png)
根据列名取指定多列的值
df.loc[['2017-01-01','2017-01-04','2017-10-01'],['bWendu','yWendu']]
根据索引取区间的值df.iloc[行,列]
#取行索引0和3,列索引0和1的数据
df.iloc[[0,3],[0,1]]
![](https://img.haomeiwen.com/i14814834/237e49c13623b800.png)
使用区间查询
- 查询20180101到20180110期间,bwendu到fengxiang列的数据
df.set_index('ymd',inplace=True)
df.loc['2018-01-01':'2018-01-10','bWendu':'fengxiang']
条件查询
df = pd.read_csv('weather.csv')
df.set_index('ymd',inplace=True)
df.loc[:,'bWendu'] = df['bWendu'].str.replace('℃','').astype(int)
df.loc[:,'yWendu'] = df['yWendu'].str.replace('℃','').astype(int)
df.loc[(df['bWendu']<=28)&(df["yWendu"]>=15)]
1.4 Pandas修改数据列
1.4.1 直接修改
将字符串列改为整数
df = pd.read_csv('weather.csv')
df.loc[:,'bWendu'] = df['bWendu'].str.replace('℃','').astype(int)
增加新的列
![](https://img.haomeiwen.com/i14814834/0a69d72543aa8445.png)
- 添加一个新的温差列
wd_type
,由高温减低温获得
df.loc[:,'wencha'] = df['yWendu'] -df['bWendu']
1.4.2 df.apply方法
添加一个新的列
(类似于sql的case when)
大于33高温,低于-10低温,其他常温
![](https://img.haomeiwen.com/i14814834/3eb829541201c2aa.png)
def get_wd_type(x):
if x['bWendu']>33:
return '高温'
elif x['yWendu']<-10:
return '低温'
else:
return '常温'
##axis=1表示为列
df['wd_type'] = df.apply(get_wd_type,axis=1)
df.head()
统计某一列的计数
类似sql的count+ group by
##查看上面分组数据的统计结果
df['wd_type'].value_counts()
![](https://img.haomeiwen.com/i14814834/6239b4bca3a3713b.png)
新增一列按照自定义函数
- 增加一列华氏度
df['yWendu_huashi'] = df.apply(lambda x : x["yWendu"] * 9 / 5 + 32,axis=1)
![](https://img.haomeiwen.com/i14814834/5d2ef887bf1c957e.png)
新增一个空列并赋值
df['wencha_type']= '' #创建一个新列
df.loc[df['bWendu']-df['yWendu']>10,'wencha_type'] = '大温差'
df.loc[df['bWendu']-df['yWendu']<=10,'wencha_type'] = '低温差'
![](https://img.haomeiwen.com/i14814834/e7f1f9145940a9e2.png)
1.5 统计函数
1.5.1 统计所有数据
df = pd.read_csv('weather.csv')
df.set_index('ymd',inplace=True)
df.loc[:,'bWendu'] = df['bWendu'].str.replace('℃','').astype(int)
df.loc[:,'yWendu'] = df['yWendu'].str.replace('℃','').astype(int)
df.describe()
![](https://img.haomeiwen.com/i14814834/67d0e09e29b7a0c6.png)
1.5.2 聚合函数
mean()平均值
df['bWendu'].mean()
max()最大
df['bWendu'].max()
最低min()
df['bWendu'].min()
1.5.3 去重和按值计数
唯一性去重
ls = df['fengxiang'].unique()
![](https://img.haomeiwen.com/i14814834/1be7a4022ae49243.png)
按值计数
df['fengxiang'].value_counts()
![](https://img.haomeiwen.com/i14814834/ab8742c033dbf451.png)
1.6 缺失值处理
![](https://img.haomeiwen.com/i14814834/e78f61decff71e5b.png)
1.6.1 跳过前面的空行
使用header跳过表前面的空行
df = pd.read_excel('student.xlsx',header=2)
使用skiprows跳过所有行
df = pd.read_excel('student.xlsx',skiprows=2)
1.6.4 查看特定行的数据情况
df.loc[df['分数'].notnull(),:]
![](https://img.haomeiwen.com/i14814834/06e43fc4a13b1d01.png)
1.6.3 删除掉全是空值的列和行
df = pd.read_excel('student.xlsx',header=2)
df.dropna(axis=1,how='all',inplace=True)
df.dropna(axis=0,how='all',inplace=True)
![](https://img.haomeiwen.com/i14814834/831ecba3b4189615.png)
1.6.4 填充NAN的值
- 将分数没有的填为0分
df.loc[:,'分数'] = df['分数'].fillna(0)
1.6.5 跨行填充
使用前面的有效值填充ffill
df = pd.read_excel('student.xlsx',header=2)
df.dropna(axis=0,how='all',inplace=True)
df.dropna(axis=1,how='all',inplace=True)
df.loc[:,'姓名'] = df['姓名'].fillna(method='ffill')
![](https://img.haomeiwen.com/i14814834/d1b216b8792de78a.png)
1.6.6 将清洗好的数据保存
df.to_excel('new_student.xlsx',index=False)
1.7 数据排序
1.7.1 Seires的排序
正序
按照aqi
从小到大排列
import pandas as pd
file_path = "./DEdata/weather.csv"
df = pd.read_csv(file_path)
# df.loc[:,"bWendu"] = df["bWendu"].str.replace("℃","").astype(int)
df["bWendu"] = df["bWendu"].str.replace("℃","").astype(int)
df["yWendu"] = df["yWendu"].str.replace("℃","").astype(int)
df["aqi"].sort_values(ascending=True)
倒序
import pandas as pd
file_path = "./DEdata/weather.csv"
df = pd.read_csv(file_path)
# df.loc[:,"bWendu"] = df["bWendu"].str.replace("℃","").astype(int)
df["bWendu"] = df["bWendu"].str.replace("℃","").astype(int)
df["yWendu"] = df["yWendu"].str.replace("℃","").astype(int)
df["aqi"].sort_values(ascending=False)
1.7.2 DataFrame的排序
正序
df.sort_values(by="aqi",ascending=True)
倒序
df.sort_values(by="aqi",ascending=False)
多列排序(正序)
先按照aqiLevel
从低到高,然后在找个排序里分组后,在按照bWendu
从低到高
df.sort_values(by=["aqiLevel","bWendu"])
多列排序(分别指定循序)
先按照aqiLevel
从低到高,然后在这个排序里分组后,根据bWendu
从高到低
df.sort_values(by=["aqiLevel","bWendu"],ascending=[True,False])
![](https://img.haomeiwen.com/i14814834/5f5d321b410d0ef1.png)