Python 数据处理(五)
2021-02-06 本文已影响0人
名本无名
3. DataFrame(续)
索引和选择
索引的基础语法如下
操作 | 语法 | 结果 |
---|---|---|
选择列 | df[col] |
Series |
用标签选择行 | df.loc[label] |
Series |
用整数位置选择行 | df.iloc[loc] |
Series |
用布尔向量选择行 | df[bool_vec] |
DataFrame |
行切片 | df[5:10] |
DataFrame |
例如,选择行返回的是 Series,其索引是 DataFrame 的列名:
In [89]: df.loc["b"]
Out[89]:
one 2.0
bar 2.0
flag False
foo bar
one_trunc 2.0
Name: b, dtype: object
In [90]: df.iloc[2]
Out[90]:
one 3.0
bar 3.0
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
关于索引切片的详细内容,我们将会在后续的索引章节详细介绍
数据对齐和运算
DataFrame 对象之间的数据会根据索引和列名自动对齐,结果将是索引和列名的并集
In [91]: df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
In [92]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])
In [93]: df + df2
Out[93]:
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
DataFrame 和 Series 之间执行操作时,默认行为是 DataFrame 的列名与 Series 的索引对齐,然后按行执行广播操作。例如
In [94]: df - df.iloc[0]
Out[94]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2 0.253128 0.829678 0.010026 -1.991234
3 -1.311128 0.054325 -1.724913 -1.620544
4 0.573025 1.500742 -0.676070 1.367331
5 -1.741248 0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282 0.000430
7 -0.743894 0.411013 -0.929563 -0.282386
8 -1.194921 1.320690 0.238224 -1.482644
9 2.293786 1.856228 0.773289 -1.446531
那如果使用的是列会发生什么
>>> df
A B C
0 1 3 4
1 2 5 0
2 3 1 1
3 4 7 6
4 5 2 2
>>> df - df['A']
A B C 0 1 2 3 4
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
因为我们提取的 A 列的索引是 0-4,与 df 的列名 A、B、C 不匹配,最后导致结果都为 NaN
标量操作与其它数据结构是一样的
In [95]: df * 5 + 2
Out[95]:
A B C D
0 3.359299 -0.124862 4.835102 3.381160
1 -3.437003 -1.368449 2.568242 -5.392133
2 4.624938 4.023526 4.885230 -6.575010
3 -3.196342 0.146766 -3.789461 -4.721559
4 6.224426 7.378849 1.454750 10.217815
5 -5.346940 3.785103 -1.373001 -6.884519
6 -2.844569 -4.472618 4.068691 3.383309
7 -0.360173 1.930201 0.187285 1.969232
8 -2.615303 6.478587 6.026220 -4.032059
9 14.828230 9.156280 8.701544 -3.851494
In [96]: 1 / df
Out[96]:
A B C D
0 3.678365 -2.353094 1.763605 3.620145
1 -0.919624 -1.484363 8.799067 -0.676395
2 1.904807 2.470934 1.732964 -0.583090
3 -0.962215 -2.697986 -0.863638 -0.743875
4 1.183593 0.929567 -9.170108 0.608434
5 -0.680555 2.800959 -1.482360 -0.562777
6 -1.032084 -0.772485 2.416988 3.614523
7 -2.118489 -71.634509 -2.758294 -162.507295
8 -1.083352 1.116424 1.241860 -0.828904
9 0.389765 0.698687 0.746097 -0.854483
In [97]: df ** 4
Out[97]:
A B C D
0 0.005462 3.261689e-02 0.103370 5.822320e-03
1 1.398165 2.059869e-01 0.000167 4.777482e+00
2 0.075962 2.682596e-02 0.110877 8.650845e+00
3 1.166571 1.887302e-02 1.797515 3.265879e+00
4 0.509555 1.339298e+00 0.000141 7.297019e+00
5 4.661717 1.624699e-02 0.207103 9.969092e+00
6 0.881334 2.808277e+00 0.029302 5.858632e-03
7 0.049647 3.797614e-08 0.017276 1.433866e-09
8 0.725974 6.437005e-01 0.420446 2.118275e+00
9 43.329821 4.196326e+00 3.227153 1.875802e+00
对于布尔运算同样适用
In [98]: df1 = pd.DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}, dtype=bool)
In [99]: df2 = pd.DataFrame({"a": [0, 1, 1], "b": [1, 1, 0]}, dtype=bool)
In [100]: df1 & df2
Out[100]:
a b
0 False False
1 False True
2 True False
In [101]: df1 | df2
Out[101]:
a b
0 True True
1 True True
2 True True
In [102]: df1 ^ df2
Out[102]:
a b
0 True True
1 True False
2 False True
In [103]: -df1
Out[103]:
a b
0 False True
1 True False
转置
与多位数组类似,可以对 DataFrame 转置,使用 T 属性或 transpose 函数
In [104]: df[:5].T
Out[104]:
0 1 2 3 4
A 0.271860 -1.087401 0.524988 -1.039268 0.844885
B -0.424972 -0.673690 0.404705 -0.370647 1.075770
C 0.567020 0.113648 0.577046 -1.157892 -0.109050
D 0.276232 -1.478427 -1.715002 -1.344312 1.643563
应用 numpy 函数
如果你的 DataFrame 存储的都是数字,可以使用许多 NumPy 的函数
In [105]: np.exp(df)
Out[105]:
A B C D
0 1.312403 0.653788 1.763006 1.318154
1 0.337092 0.509824 1.120358 0.227996
2 1.690438 1.498861 1.780770 0.179963
3 0.353713 0.690288 0.314148 0.260719
4 2.327710 2.932249 0.896686 5.173571
5 0.230066 1.429065 0.509360 0.169161
6 0.379495 0.274028 1.512461 1.318720
7 0.623732 0.986137 0.695904 0.993865
8 0.397301 2.449092 2.237242 0.299269
9 13.009059 4.183951 3.820223 0.310274
In [106]: np.asarray(df)
Out[106]:
array([[ 0.2719, -0.425 , 0.567 , 0.2762],
[-1.0874, -0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 , -1.715 ],
[-1.0393, -0.3706, -1.1579, -1.3443],
[ 0.8449, 1.0758, -0.109 , 1.6436],
[-1.4694, 0.357 , -0.6746, -1.7769],
[-0.9689, -1.2945, 0.4137, 0.2767],
[-0.472 , -0.014 , -0.3625, -0.0062],
[-0.9231, 0.8957, 0.8052, -1.2064],
[ 2.5656, 1.4313, 1.3403, -1.1703]])
如果在 NumPy 通用函数中使用了多个 Series,会在执行函数之前,自动对齐。
例如
In [109]: ser1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
In [110]: ser2 = pd.Series([1, 3, 5], index=["b", "a", "c"])
In [111]: ser1
Out[111]:
a 1
b 2
c 3
dtype: int64
In [112]: ser2
Out[112]:
b 1
a 3
c 5
dtype: int64
In [113]: np.remainder(ser1, ser2)
Out[113]:
a 1
b 0
c 3
dtype: int64
如果存在对应不上的索引,会被赋值为 NaN
In [114]: ser3 = pd.Series([2, 4, 6], index=["b", "c", "d"])
In [115]: ser3
Out[115]:
b 2
c 4
d 6
dtype: int64
In [116]: np.remainder(ser1, ser3)
Out[116]:
a NaN
b 0.0
c 3.0
d NaN
dtype: float64
如果在 Series 和 index 上应用二元函数时,会按照 Series 执行并输出
In [117]: ser = pd.Series([1, 2, 3])
In [118]: idx = pd.Index([4, 5, 6])
In [119]: np.maximum(ser, idx)
Out[119]:
0 4
1 5
2 6
dtype: int64
控制台显示
在控制台显示大型数据时,会根据数据量进行折叠展示前面和后面的几行
In [120]: baseball = pd.read_csv("data/baseball.csv")
In [121]: print(baseball)
id player year stint team lg g ab r h ... rbi sb cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... ... ... .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
[100 rows x 23 columns]
可以使用 info 函数显示汇总信息
In [122]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100 non-null int64
1 player 100 non-null object
2 year 100 non-null int64
3 stint 100 non-null int64
4 team 100 non-null object
5 lg 100 non-null object
6 g 100 non-null int64
7 ab 100 non-null int64
8 r 100 non-null int64
9 h 100 non-null int64
10 X2b 100 non-null int64
11 X3b 100 non-null int64
12 hr 100 non-null int64
13 rbi 100 non-null float64
14 sb 100 non-null float64
15 cs 100 non-null float64
16 bb 100 non-null int64
17 so 100 non-null float64
18 ibb 100 non-null float64
19 hbp 100 non-null float64
20 sh 100 non-null float64
21 sf 100 non-null float64
22 gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.1+ KB
默认情况下,过宽的数据会换行打印,可以设置列宽 display.width 来控制
In [123]: pd.set_option("display.width", 40) # default is 80
In [124]: pd.DataFrame(np.random.randn(3, 12))
Out[124]:
0 1 2 3 4 ... 7 8 9 10 11
0 -2.182937 0.380396 0.084844 0.432390 1.519970 ... 0.274230 0.132885 -0.023688 2.410179 1.450520
1 0.206053 -0.251905 -2.213588 1.063327 1.266143 ... 0.408204 -1.048089 -0.025747 -0.988387 0.094055
2 1.262731 1.289997 0.082423 -0.055758 0.536580 ... -0.034571 -2.484478 -0.281461 0.030711 0.109121
[3 rows x 12 columns]
还可以设置最大列宽 display.max_colwidth 来控制
In [125]: datafile = {
.....: "filename": ["filename_01", "filename_02"],
.....: "path": [
.....: "media/user_name/storage/folder_01/filename_01",
.....: "media/user_name/storage/folder_02/filename_02",
.....: ],
.....: }
.....:
In [126]: pd.set_option("display.max_colwidth", 30)
In [127]: pd.DataFrame(datafile)
Out[127]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...
In [128]: pd.set_option("display.max_colwidth", 100)
In [129]: pd.DataFrame(datafile)
Out[129]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02
DataFrame 列属性
如果 DataFrame 的列名是有效的 Python 变量名时,可以通过访问对象属性的方式提取对应的列
In [130]: df = pd.DataFrame({'foo1': np.random.randn(5),
.....: 'foo2': np.random.randn(5)})
.....:
In [131]: df
Out[131]:
foo1 foo2
0 1.171216 -0.858447
1 0.520260 0.306996
2 -1.197071 -0.028665
3 -1.066969 0.384316
4 -0.303421 1.574159
In [132]: df.foo1
Out[132]:
0 1.171216
1 0.520260
2 -1.197071
3 -1.066969
4 -0.303421
Name: foo1, dtype: float64