整洁数据:汇总多列(melt)&拆分列(pivot)

2018-10-09 本文已影响6人 IntoTheVoid

识别整洁的数据
要使数据整洁，它必须具有：

每个变量作为单独的列。
每行作为单独的观察。
作为数据科学家，您将遇到以各种不同方式表示的数据，因此在您看到数据时能够识别整洁（或不整洁）数据非常重要。

image.png

例如上图中的df2就不是整洁的数据,因为多个变量都挤在了同一列中.

Melt data

DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted.

假如想让df1变为df2的样子, 代码如下:

# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())

可以通过如下方式给融合的列自定义列名

# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())

image.png

Pivot data

pivot data 是melt data 的逆过程

image.png

示例:

# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())

image.png

恢复索引

# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())

image.png

Pivoting duplicate values

在pivot_table中传入aggfunc=

Splitting a column with .split() and .get()

对于如下的结构

image.png

如何将Cases_xxxx和Death_xxxx拆成两列一列为case, 一列为xxxx

Melt ebola using 'Date' and 'Day' as theid_vars,'type_country' as the var_name, and'counts'as the value_name.
Create a column called'str_split' by splitting the'type_country' column of ebola_melt on'_'. Note that you will first have to access the str attribute of type_country before you can use .split().
Create a column called'type' by using the .get() method to retrieve index 0of the'str_split' column of ebola_melt.
Create a column called'country' by using the.get()method to retrieve index 1 of the'str_split' column of ebola_melt.
Print the head of ebola.

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str[0]

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str[1]

# Print the head of ebola_melt
print(ebola_melt.head())

整洁数据:汇总多列(melt)&拆分列(pivot)

Melt data

Pivot data

Pivoting duplicate values

Splitting a column with .split() and .get()

猜你喜欢

热点阅读