整洁数据:汇总多列(melt)&拆分列(pivot)
2018-10-09 本文已影响6人
IntoTheVoid
识别整洁的数据
要使数据整洁,它必须具有:
- 每个变量作为单独的列。
- 每行作为单独的观察。
作为数据科学家,您将遇到以各种不同方式表示的数据,因此在您看到数据时能够识别整洁(或不整洁)数据非常重要。
image.png
例如上图中的df2就不是整洁的数据,因为多个变量都挤在了同一列中.
Melt data
DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted.
假如想让df1变为df2的样子, 代码如下:
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
# Print the head of airquality_melt
print(airquality_melt.head())
可以通过如下方式给融合的列自定义列名
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
# Print the head of airquality_melt
print(airquality_melt.head())
image.png
Pivot data
pivot data 是melt data 的逆过程
image.png
示例:
# Print the head of airquality_melt
print(airquality_melt.head())
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')
# Print the head of airquality_pivot
print(airquality_pivot.head())
image.png
恢复索引
# Print the index of airquality_pivot
print(airquality_pivot.index)
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
image.png
Pivoting duplicate values
在pivot_table中传入aggfunc=
Splitting a column with .split() and .get()
对于如下的结构
image.png
如何将Cases_xxxx和Death_xxxx拆成两列一列为case, 一列为xxxx
- Melt
ebolausing'Date'and'Day'as theid_vars,'type_country'as thevar_name, and'counts'as thevalue_name. - Create a column called
'str_split'by splitting the'type_country'column ofebola_melton'_'. Note that you will first have to access thestrattribute oftype_countrybefore you can use.split(). - Create a column called
'type'by using the.get()method to retrieve index0of the'str_split'column ofebola_melt. - Create a column called
'country'by using the.get()method to retrieve index 1 of the'str_split'column ofebola_melt. - Print the head of
ebola.
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str[0]
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str[1]
# Print the head of ebola_melt
print(ebola_melt.head())