Pandas - 5. 缺失值 处理

2022-05-22  本文已影响0人  陈天睡懒觉

判断缺失值

import pandas as pd
from numpy import NaN,NAN,nan
import numpy as np
print(pd.isnull(NaN))
print(pd.isnull(NAN))
print(pd.isnull(nan))
print(pd.isnull(True))
True
True
True
False
print(pd.notnull(NaN))
print(pd.notnull(NAN))
print(pd.notnull(nan))
print(pd.notnull(True))
False
False
False
True

读取文件时产生的缺失值

pd.read_csv()函数中有三个参数与缺失值有关:

print(pd.read_csv('data/survey_visited.csv'))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# 加载数据时不包含默认缺失值
print(pd.read_csv('data/survey_visited.csv',
                 keep_default_na=False))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3            
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# 手动指定缺失值
print(pd.read_csv('data/survey_visited.csv',
                  na_values=[''],
                 keep_default_na=False))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

统计缺失值

ebola = pd.read_csv('data/country_timeseries.csv')
# 统计非缺失值的个数
print(ebola.count())
Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64
# 展示缺失值个数
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count() # 广播
print(num_missing)
Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64
# 统计缺失值的总数
print(np.count_nonzero(ebola.isnull()))
# 统计某列的缺失值数量
print(np.count_nonzero(ebola['Cases_Liberia'].isnull()))
1214
39

缺失值处理方法

替换 fillna()

注:有inplace参数,可以直接在原数据上修改,大型数据上可以提高效率

插值 interpolate()

删除 dropna()

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

subset设置子集有交叉的感觉。删除哪些行,从列中设置子集。删除哪些列,从行中设置子集。

# 构建数据集
import numpy as np
import pandas as pd
 
m = np.ones((11, 10))
for i in range(len(m)):
    m[i,:i] = np.nan
df = pd.DataFrame(data=m,
                 columns=['A','B','C','D','E','F','G','H','I','J'],
                index=['a','b','c','d','e','f','g','h','i','j','k'])
print(df)
     A    B    C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0
i  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0
j  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0
k  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 删除行索引为'a','b','c'的样本中带有缺失值的列
print(df.dropna(axis=1, subset=['a','b','c']))
     C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0
i  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0
j  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0
k  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 删除列索引为'H','I','J'的样本中带有缺失值的行
print(df.dropna(axis=0, subset=['H','I','J']))
     A    B    C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0

带有缺失值的计算

如果计算中包含缺失值,结果通常返回缺失值,sum和mean可以忽略缺失值,可通过参数指定计算时是否要忽略缺失值

查看带有缺失值的样本

visited = pd.read_csv('data/survey_visited.csv',
                  na_values=[''],
                 keep_default_na=False)
print(visited)
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# isnull()会对DataFrame中的每个元素进行缺失值检查,若为缺失值返回True;不是缺失值返回False;最终返回一个DataFrame.
miss = visited.isnull()
print(miss)
   ident   site  dated
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
4  False  False  False
5  False  False   True
6  False  False  False
7  False  False  False
# 使用any,并设定axis=1,则当每一行中存在缺失值时就会返回True;若需要找到所有缺失值都为True的行则使用all即可。
print(miss.any(axis=1))
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
dtype: bool
print(visited[miss.any(axis=1)==True])
   ident  site dated
5    752  DR-3   NaN

一步到位

print(visited[visited.isnull().any(axis=1)==True])
   ident  site dated
5    752  DR-3   NaN
上一篇 下一篇

猜你喜欢

热点阅读