整洁数据:汇总多列(melt)&拆分列(pivot)
2018-10-09 本文已影响6人
IntoTheVoid
识别整洁的数据
要使数据整洁,它必须具有:
- 每个变量作为单独的列。
- 每行作为单独的观察。
作为数据科学家,您将遇到以各种不同方式表示的数据,因此在您看到数据时能够识别整洁(或不整洁)数据非常重要。
例如上图中的df2就不是整洁的数据,因为多个变量都挤在了同一列中.
Melt data
DataFrame using pd.melt()
. There are two parameters you should be aware of: id_vars
and value_vars
. The id_vars
represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars
represent the columns you do wish to melt into rows. By default, if no value_vars
are provided, all columns not set in the id_vars
will be melted.
假如想让df1变为df2的样子, 代码如下:
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
# Print the head of airquality_melt
print(airquality_melt.head())
可以通过如下方式给融合的列自定义列名
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
# Print the head of airquality_melt
print(airquality_melt.head())
image.png
Pivot data
pivot data 是melt data 的逆过程
image.png
示例:
# Print the head of airquality_melt
print(airquality_melt.head())
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')
# Print the head of airquality_pivot
print(airquality_pivot.head())
image.png
恢复索引
# Print the index of airquality_pivot
print(airquality_pivot.index)
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
image.png
Pivoting duplicate values
在pivot_table
中传入aggfunc=
Splitting a column with .split() and .get()
对于如下的结构
image.png
如何将Cases_xxxx
和Death_xxxx
拆成两列一列为case, 一列为xxxx
- Melt
ebola
using'Date'
and'Day'
as theid_vars
,'type_country'
as thevar_name
, and'counts'
as thevalue_name
. - Create a column called
'str_split'
by splitting the'type_country'
column ofebola_melt
on'_'
. Note that you will first have to access thestr
attribute oftype_country
before you can use.split()
. - Create a column called
'type'
by using the.get()
method to retrieve index0
of the'str_split'
column ofebola_melt
. - Create a column called
'country'
by using the.get()
method to retrieve index 1 of the'str_split'
column ofebola_melt.
- Print the head of
ebola
.
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str[0]
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str[1]
# Print the head of ebola_melt
print(ebola_melt.head())