Python 数据处理(三十四)—— 重塑数据(续)
4 与统计函数和 groupby 结合使用
将 pivot/stack/unstack
与 groupby
以及基本的统计函数结合使用,可以展现一些非常快速数据操作
In [54]: df
Out[54]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
two -0.076467 -1.187678 1.130127 -1.436737
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux one -0.410001 -0.078638 0.545952 -1.219217
two -1.226825 0.769804 -1.281247 -0.727707
In [55]: df.stack().mean(1).unstack()
Out[55]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
# 使用不同的方法,结果与上面的相同
In [56]: df.groupby(level=1, axis=1).mean()
Out[56]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
In [57]: df.stack().groupby(level=1).mean()
Out[57]:
exp A B
second
one 0.071448 0.455513
two -0.424186 -0.204486
In [58]: df.mean().unstack(0)
Out[58]:
exp A B
animal
cat 0.060843 0.018596
dog -0.413580 0.232430
5 透视表
虽然 pivot()
提供了各种数据类型(字符串、数字等)的通用转换,但 pandas
也提供了 pivot_table()
用于聚合的数值数据的转换
函数 pivot_table()
可用于创建 Excel
电子表格样式的透视表
主要包含如下参数:
考虑下面的数据
In [59]: import datetime
In [60]: df = pd.DataFrame(
....: {
....: "A": ["one", "one", "two", "three"] * 6,
....: "B": ["A", "B", "C"] * 8,
....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
....: "D": np.random.randn(24),
....: "E": np.random.randn(24),
....: "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
....: + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
....: }
....: )
....:
In [61]: df
Out[61]:
A B C D E F
0 one A foo 0.341734 -0.317441 2013-01-01
1 one B foo 0.959726 -1.236269 2013-02-01
2 two C foo -1.110336 0.896171 2013-03-01
3 three A bar -0.619976 -0.487602 2013-04-01
4 one B bar 0.149748 -0.082240 2013-05-01
.. ... .. ... ... ... ...
19 three B foo 0.690579 -2.213588 2013-08-15
20 one C foo 0.995761 1.063327 2013-09-15
21 one A bar 2.396780 1.266143 2013-10-15
22 two B bar 0.014871 0.299368 2013-11-15
23 three C bar 3.357427 -0.863838 2013-12-15
[24 rows x 6 columns]
我们可以很方便的从这个数据中生成透视表
In [62]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[62]:
C bar foo
A B
one A 1.120915 -0.514058
B -0.338421 0.002759
C -0.538846 0.699535
three A -1.181568 NaN
B NaN 0.433512
C 0.588783 NaN
two A NaN 1.000985
B 0.158248 NaN
C NaN 0.176180
In [63]: pd.pivot_table(df, values="D", index=["B"], columns=["A", "C"], aggfunc=np.sum)
Out[63]:
A one three two
C bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360
In [64]: pd.pivot_table(
....: df, values=["D", "E"],
....: index=["B"],
....: columns=["A", "C"],
....: aggfunc=np.sum,
....: )
....:
Out[64]:
D E
A one three two one three two
C bar foo bar foo bar foo bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971 2.786113 -0.043211 1.922577 NaN NaN 0.128491
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN 1.368280 -1.103384 NaN -2.128743 -0.194294 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360 -1.976883 1.495717 -0.263660 NaN NaN 0.872482
产生的 DataFrame
在行和列上可能会生成层次索引,如果未指定 values
参数,则默认会把剩余能够整合的列添加到额外的列索引上
In [65]: pd.pivot_table(df, index=["A", "B"], columns=["C"])
Out[65]:
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 NaN 0.961289 NaN
B NaN 0.433512 NaN -1.064372
C 0.588783 NaN -0.131830 NaN
two A NaN 1.000985 NaN 0.064245
B 0.158248 NaN -0.097147 NaN
C NaN 0.176180 NaN 0.436241
也可以为 index
和 columns
参数指定 Grouper
In [66]: pd.pivot_table(df, values="D", index=pd.Grouper(freq="M", key="F"), columns="C")
Out[66]:
C bar foo
F
2013-01-31 NaN -0.514058
2013-02-28 NaN 0.002759
2013-03-31 NaN 0.176180
2013-04-30 -1.181568 NaN
2013-05-31 -0.338421 NaN
2013-06-30 -0.538846 NaN
2013-07-31 NaN 1.000985
2013-08-31 NaN 0.433512
2013-09-30 NaN 0.699535
2013-10-31 1.120915 NaN
2013-11-30 0.158248 NaN
2013-12-31 0.588783 NaN
你可以通过调用 to_string
将输出渲染为字符串,省略掉缺失的值
In [67]: table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
In [68]: print(table.to_string(na_rep=""))
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 0.961289
B 0.433512 -1.064372
C 0.588783 -0.131830
two A 1.000985 0.064245
B 0.158248 -0.097147
C 0.176180 0.436241
pivot_table
也有对应的实例方法,如 DataFrame.pivot_table()
5.1 添加 margins 参数
如果在 pivot_table
中设置了 margins=True
,会在输出结果中添加名为 All
的行和列,作为分类数据的汇总
In [69]: df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
Out[69]:
D E
C bar foo All bar foo All
A B
one A 1.804346 1.210272 1.569879 0.179483 0.418374 0.858005
B 0.690376 1.353355 0.898998 1.083825 0.968138 1.101401
C 0.273641 0.418926 0.771139 1.689271 0.446140 1.422136
three A 0.794212 NaN 0.794212 2.049040 NaN 2.049040
B NaN 0.363548 0.363548 NaN 1.625237 1.625237
C 3.915454 NaN 3.915454 1.035215 NaN 1.035215
two A NaN 0.442998 0.442998 NaN 0.447104 0.447104
B 0.202765 NaN 0.202765 0.560757 NaN 0.560757
C NaN 1.819408 1.819408 NaN 0.650439 0.650439
All 1.556686 0.952552 1.246608 1.250924 0.899904 1.059389
6 交叉表
使用 crosstab()
可以计算两个或多个因子的交叉表。默认会统计因子的频率,除非传入了一组数组值和聚合函数
参数如下:
除非为交叉表指定了行或列的名称,否则传递的任何 Series
都将使用其 name
属性命名
例如
In [70]: foo, bar, dull, shiny, one, two = "foo", "bar", "dull", "shiny", "one", "two"
In [71]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
In [72]: b = np.array([one, one, two, one, two, one], dtype=object)
In [73]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
In [74]: pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
Out[74]:
b one two
c dull shiny dull shiny
a
bar 1 0 0 1
foo 2 1 1 0
如果 crosstab
只传递了两个 Series
,将会返回一个频数表
In [75]: df = pd.DataFrame(
....: {"A": [1, 2, 2, 2, 2], "B": [3, 3, 4, 4, 4], "C": [1, 1, np.nan, 1, 1]}
....: )
....:
In [76]: df
Out[76]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
In [77]: pd.crosstab(df["A"], df["B"])
Out[77]:
B 3 4
A
1 1 0
2 1 3
也可以传入 Categorical
类型的数据
In [78]: foo = pd.Categorical(["a", "b"], categories=["a", "b", "c"])
In [79]: bar = pd.Categorical(["d", "e"], categories=["d", "e", "f"])
In [80]: pd.crosstab(foo, bar)
Out[80]:
col_0 d e
row_0
a 1 0
b 0 1
如果你想包含所有的类别信息,即使实际数据不包含任何特定类别的实例,你可以设置 dropna=False
In [81]: pd.crosstab(foo, bar, dropna=False)
Out[81]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
6.1 标准化
频率表也可以使用 normalize
参数进行标准化,以显示百分比而不是计数
In [82]: pd.crosstab(df["A"], df["B"], normalize=True)
Out[82]:
B 3 4
A
1 0.2 0.0
2 0.2 0.6
normalize
还可以对每一行或每一列中的值进行标准化
In [83]: pd.crosstab(df["A"], df["B"], normalize="columns")
Out[83]:
B 3 4
A
1 0.5 0.0
2 0.5 1.0
crosstab
还可以传递第三个 Series
和一个聚合函数(aggfunc
),使用前两个 Series
定义的组和对第三个 Series
分组,并在每个分组中应用聚合函数
In [84]: pd.crosstab(df["A"], df["B"], values=df["C"], aggfunc=np.sum)
Out[84]:
B 3 4
A
1 1.0 NaN
2 1.0 2.0
6.2 添加 margins 参数
该函数也支持 margins
参数
In [85]: pd.crosstab(
....: df["A"], df["B"], values=df["C"], aggfunc=np.sum, normalize=True, margins=True
....: )
....:
Out[85]:
B 3 4 All
A
1 0.25 0.0 0.25
2 0.25 0.5 0.75
All 0.50 0.5 1.00
7 分片
cut()
函数计算输入数组值的分组,通常用于将连续变量转化为离散变量或分类变量。
In [86]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
In [87]: pd.cut(ages, bins=3)
Out[87]:
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
如果 bins
关键字是整数,则将形成等宽的 bin
。我们也可以指定自定义边界
In [88]: c = pd.cut(ages, bins=[0, 18, 35, 70])
In [89]: c
Out[89]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
如果关键字 bins
是 IntervalIndex
,那么将对传入的数据进行变换
>>> pd.cut([25, 20, 50], bins=c.categories)
[(18, 35], (18, 35], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
8 计算标记变量
我们可以将类别型的变量转换为一种标记型的 DataFrame
。例如, DataFrame
的一列或一个 Series
对象,包含 k
个不同的值。
使用 get_dummies
可以将其转换为一个包含 k
列 0
、1
值的 DataFrame
In [90]: df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
In [91]: pd.get_dummies(df["key"])
Out[91]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
key
这一列包含 6
个值,有 3
个不重复的值,使用 get_dummies
后将其转换为了 6×3
的 DataFrame
,值为 1
指示了对应位置出现了该元素
可以为这些列名添加一个前缀
In [92]: dummies = pd.get_dummies(df["key"], prefix="key")
In [93]: dummies
Out[93]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [94]: df[["data1"]].join(dummies)
Out[94]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
这个函数经常与 cut
函数一起使用
In [95]: values = np.random.randn(10)
In [96]: values
Out[96]:
array([ 0.4082, -1.0481, -0.0257, -0.9884, 0.0941, 1.2627, 1.29 ,
0.0824, -0.0558, 0.5366])
In [97]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
In [98]: pd.get_dummies(pd.cut(values, bins))
Out[98]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 1 0 0 0 0
8 0 0 0 0 0
9 0 0 1 0 0
get_dummies()
也可以传入一个 DataFrame
。
默认情况下,所有的分类变量都被编码为虚拟变量
In [99]: df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]})
In [100]: pd.get_dummies(df)
Out[100]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
所有的非对象列都会被原样输出,你可以用 columns
关键字控制被编码的列
In [101]: pd.get_dummies(df, columns=["A"])
Out[101]:
B C A_a A_b
0 c 1 1 0
1 c 2 0 1
2 b 3 1 0
与 Series
一样,可以使用 prefix
和 prefix_sep
设置列名前缀和前缀分隔符。
默认情况下,使用原来的列名作为前缀,使用 _
作为分隔符,支持三种设置方式
- 字符串:每列使用相同的前缀
- 列表:与列名长度一致
- 字典:列名映射到前缀的字典
In [102]: simple = pd.get_dummies(df, prefix="new_prefix")
In [103]: simple
Out[103]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [104]: from_list = pd.get_dummies(df, prefix=["from_A", "from_B"])
In [105]: from_list
Out[105]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [106]: from_dict = pd.get_dummies(df, prefix={"B": "from_B", "A": "from_A"})
In [107]: from_dict
Out[107]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
有时,将结果提交给统计模型时,只要保留分类变量的 k-1
个水平,以避免共线性。
可以使用 drop_first
开启此功能
In [108]: s = pd.Series(list("abcaa"))
In [109]: pd.get_dummies(s)
Out[109]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
In [110]: pd.get_dummies(s, drop_first=True)
Out[110]:
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
当一个列只包含一个分类级别时,它将在结果中被省略
In [111]: df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})
In [112]: pd.get_dummies(df)
Out[112]:
A_a B_a B_b B_c
0 1 1 0 0
1 1 0 1 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
In [113]: pd.get_dummies(df, drop_first=True)
Out[113]:
B_b B_c
0 0 0
1 1 0
2 0 0
3 1 0
4 0 1
默认情况下,新列是 np.uint8
类型,可以使用 dtype
参数设置类型
In [114]: df = pd.DataFrame({"A": list("abc"), "B": [1.1, 2.2, 3.3]})
In [115]: pd.get_dummies(df, dtype=bool).dtypes
Out[115]:
B float64
A_a bool
A_b bool
A_c bool
dtype: object
9 因子化值
可以使用 factorize()
将 一维的值编码为一个枚举类型
In [116]: x = pd.Series(["A", "A", np.nan, "B", 3.14, np.inf])
In [117]: x
Out[117]:
0 A
1 A
2 NaN
3 B
4 3.14
5 inf
dtype: object
In [118]: labels, uniques = pd.factorize(x)
In [119]: labels
Out[119]: array([ 0, 0, -1, 1, 2, 3])
In [120]: uniques
Out[120]: Index(['A', 'B', 3.14, inf], dtype='object')