数据分析工具pandas快速入门教程5-处理缺失数据
2018-08-24 本文已影响3人
python测试开发
第5章 缺失数据
介绍
很少没有任何缺失值的数据集。 有许多缺失数据的表示。 在数据库中是NULL值,一些编程语言使用NA。缺失值可以是空字符串:''或者甚至是数值88或99等。Pandas显示缺失值为NaN。
本章将涵盖:
- 什么是缺失值
- 如何创建缺失值
- 如何重新编码并使用缺失值进行计算
什么是缺失值
可以从numpy中获得NaN值,在Python中看到缺失值使用几种方式显示:NaN,NAN或nan,他们都是相等的。
NaN不等于0或空字符串''。
In [1]: from numpy import NaN, NAN, nan
In [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')
False|False|False|False
In [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')
False|False|False|False
In [4]: import pandas as pd
In [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')
True|True|True
In [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')
False|True|True
缺失值的来源
来自加载数据或数据处理
- 加载数据
当我们加载数据时,pandas会自动找到该缺少数据的单元格,并填充NaN值。在read_csv函数中,参数na_values, keep_default_na, na_filter用于处理缺失值。比如:na_values=[99]。na_filter设置为False,在读大文件时会提升性能。
5-1.py
import pandas as pd
visited_file = 'data/survey_visited.csv'
print(pd.read_csv(visited_file))
print(pd.read_csv(visited_file, keep_default_na=False))
print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))
执行结果
$ python3 5-1.py
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
- 合并数据
import pandas as pd
visited = pd.read_csv('data/survey_visited.csv')
survey = pd.read_csv('data/survey_survey.csv')
print(visited)
print(survey)
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(vs)
执行结果
$ python3 5-2.py
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
5 734 lake sal 0.05
6 734 pb temp -21.50
7 735 pb rad 7.22
8 735 NaN sal 0.06
9 735 NaN temp -26.00
10 751 pb rad 4.35
11 751 pb temp -18.50
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
ident site dated taken person quant reading
0 619 DR-1 1927-02-08 619 dyer rad 9.82
1 619 DR-1 1927-02-08 619 dyer sal 0.13
2 622 DR-1 1927-02-10 622 dyer rad 7.80
3 622 DR-1 1927-02-10 622 dyer sal 0.09
4 734 DR-3 1939-01-07 734 pb rad 8.41
5 734 DR-3 1939-01-07 734 lake sal 0.05
6 734 DR-3 1939-01-07 734 pb temp -21.50
7 735 DR-3 1930-01-12 735 pb rad 7.22
8 735 DR-3 1930-01-12 735 NaN sal 0.06
9 735 DR-3 1930-01-12 735 NaN temp -26.00
10 751 DR-3 1930-02-26 751 pb rad 4.35
11 751 DR-3 1930-02-26 751 pb temp -18.50
12 751 DR-3 1930-02-26 751 lake sal 0.10
13 752 DR-3 NaN 752 lake rad 2.19
14 752 DR-3 NaN 752 lake sal 0.09
15 752 DR-3 NaN 752 lake temp -16.00
16 752 DR-3 NaN 752 roe sal 41.60
17 837 MSK-4 1932-01-14 837 lake rad 1.46
18 837 MSK-4 1932-01-14 837 lake sal 0.21
19 837 MSK-4 1932-01-14 837 roe sal 22.50
20 844 DR-1 1932-03-22 844 roe rad 11.25
- 用户输入
import pandas as pd
from numpy import NaN, NAN, nan
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)
scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'missing': [NaN, nan]})
print(scientists)
scientists['missing'] = nan
print(scientists)
执行结果
$ python3 5-3.py
amoeba NaN
goat 4.0
dtype: float64
Born Died Name Occupation missing
0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN
1 1876-06-13 1937-10-16 William Gosset Statistician NaN
Born Died Name Occupation missing
0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN
1 1876-06-13 1937-10-16 William Gosset Statistician NaN
- 重新索引
5-4.py
import pandas as pd
from numpy import NaN, NAN, nan
gapminder = pd.read_csv('data/gapminder.tsv', sep='\t')
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
print(life_exp.reindex(range(2000, 2010)))
执行结果
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
year
2000 NaN
2001 NaN
2002 65.694923
2003 NaN
2004 NaN
2005 NaN
2006 NaN
2007 67.007423
2008 NaN
2009 NaN
Name: lifeExp, dtype: float64
处理缺失数据
- 统计缺失数据
5-5.py
import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.head())
print(ebola.count())
num_rows = ebola.shape[0]
print("num_rows")
print(num_rows)
num_missing = num_rows - ebola.count()
print("num_missing:")
print(num_missing)
print(np.count_nonzero(ebola.isnull()))
print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
print(ebola.Cases_Guinea.value_counts(dropna=False).head())
执行结果
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 9633.0
Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \
0 1786.0 NaN 2977.0 NaN
1 1781.0 NaN 2943.0 NaN
2 1767.0 3496.0 2915.0 NaN
3 NaN 3496.0 NaN NaN
4 1739.0 3471.0 2827.0 NaN
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Date 122
Day 122
Cases_Guinea 93
Cases_Liberia 83
Cases_SierraLeone 87
Cases_Nigeria 38
Cases_Senegal 25
Cases_UnitedStates 18
Cases_Spain 16
Cases_Mali 12
Deaths_Guinea 92
Deaths_Liberia 81
Deaths_SierraLeone 87
Deaths_Nigeria 38
Deaths_Senegal 22
Deaths_UnitedStates 18
Deaths_Spain 16
Deaths_Mali 12
dtype: int64
num_rows
122
num_missing:
Date 0
Day 0
Cases_Guinea 29
Cases_Liberia 39
Cases_SierraLeone 35
Cases_Nigeria 84
Cases_Senegal 97
Cases_UnitedStates 104
Cases_Spain 106
Cases_Mali 110
Deaths_Guinea 30
Deaths_Liberia 41
Deaths_SierraLeone 35
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
dtype: int64
1214
29
NaN 29
86.0 3
495.0 2
112.0 2
390.0 2
Name: Cases_Guinea, dtype: int64
- 处理缺失数据
5-6.py
import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.iloc[0:10, 0:5])
print(ebola.fillna(0).iloc[0:10, 0:5])
# 前向填充
print(ebola.fillna(method='ffill').iloc[0:10, 0:5])
# 后向填充
print(ebola.fillna(method='bfill').iloc[0:10, 0:5])
print(ebola.interpolate().iloc[0:10, 0:5])
print(ebola.shape)
ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)
print(ebola_dropna)
ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \
ebola['Cases_SierraLeone']
ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_multiple']]
print(ebola_subset.head(n=10))
print(ebola.Cases_Guinea.sum(skipna = True))
print(ebola.Cases_Guinea.sum(skipna = False))
执行结果
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 NaN 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 NaN 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 0.0 10030.0
1 1/4/2015 288 2775.0 0.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 0.0 8157.0 0.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 0.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 0.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2769.0 8157.0 9722.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 8018.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7977.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 8166.0 10030.0
1 1/4/2015 288 2775.0 8166.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2730.0 8157.0 9633.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 7977.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7862.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2749.5 8157.0 9677.5
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 7997.5 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7919.5 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
(122, 18)
(1, 18)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \
19 11/18/2014 241 2047.0 7082.0 6190.0
Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \
19 20.0 1.0 4.0 1.0 6.0
Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \
19 1214.0 2963.0 1267.0 8.0
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali
19 0.0 1.0 0.0 6.0
Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_multiple
0 2776.0 NaN 10030.0 NaN
1 2775.0 NaN 9780.0 NaN
2 2769.0 8166.0 9722.0 20657.0
3 NaN 8157.0 NaN NaN
4 2730.0 8115.0 9633.0 20478.0
5 2706.0 8018.0 9446.0 20170.0
6 2695.0 NaN 9409.0 NaN
7 2630.0 7977.0 9203.0 19810.0
8 2597.0 NaN 9004.0 NaN
9 2571.0 7862.0 8939.0 19372.0
84729.0
nan
参考资料
- 技术支持qq群144081101 591302926 567351477 钉钉免费群21745728
- 本文最新版本地址
- 本文涉及的python测试开发库 谢谢点赞!
- 本文相关海量书籍下载
- 源码下载
- 本文英文版书籍下载