DataCamp课程 <用dplyr操作数据> Chapter2
2021-07-15 本文已影响0人
Jason数据分析生信教室
用dplyr操作数据课程目录
Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练
Chapter2. 数据统计
用count()
函数计算频次
续Chapter1,首先选取多个变量形成新的数据集。然后计算各个region
出现的频次。使用sort
进行排序。
counties_selected <- counties %>%
select(county, region, state, population, citizens)
counties_selected %>%
count(region,sort=T)
# A tibble: 4 x 2
region n
<chr> <int>
1 South 1420
2 North Central 1054
3 West 447
4 Northeast 217
设置wt
来给排序添加加权参数。可以让数据按照wt
来排列。
counties_selected %>%
# Add population_walk containing the total number of people who walk to work
mutate(population_walk = population * walk / 100) %>%
# Count weighted by the new column
count(state, wt = population_walk, sort = TRUE)
# A tibble: 50 x 2
state n
<chr> <dbl>
1 New York 1237938.
2 California 1017964.
3 Pennsylvania 505397.
4 Texas 430783.
5 Illinois 400346.
6 Massachusetts 316765.
7 Florida 284723.
8 New Jersey 273047.
9 Ohio 266911.
10 Washington 239764.
# ... with 40 more rows
mutate
和count
组合使用
新增一个变量,根据新增的变量给state
加权,并排序。
counties_selected %>%
# Add population_walk containing the total number of people who walk to work
mutate(population_walk = population * walk / 100) %>%
# Count weighted by the new column
count(state, wt = population_walk, sort = TRUE)
用summarize
和group_by
进行描述行统计
用group_by
根据state
进行分组,然后用summarize
进行描述行统计。计算出对象数据的和。
# Group by state and find the total area and population
counties_selected %>%
select(state, county, population, land_area) %>%
group_by(state) %>%
summarize(total_area=sum(land_area),total_population=sum(population))
# A tibble: 1 x 3
min_population max_unemployment average_income
<dbl> <dbl> <dbl>
1 85 29.4 46832.
再来一个稍微复杂的练习,先根据region, state
对数据进行分组,然后对每组的population
进行求和统计结果命名为total_pop
,最后计算total_pop
的平均值和中位数。
# Calculate the average_pop and median_pop columns
counties_selected %>%
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
summarize(average_pop = mean(total_pop),median_pop=median(total_pop))
# A tibble: 4 x 3
region average_pop median_pop
<chr> <dbl> <dbl>
1 North Central 5627687. 5580644
2 Northeast 6221058. 3593222
3 South 7370486 4804098
4 West 5722755. 2798636
top_n
的用法
top_n
相当于分组以后选取每个小组里某个变量名列前n的数据,n可以是任意数,根据自己需求设置。
比方说下面的例子,根据region
分组,并选取每个组里,walk
里的最大值。这时n=1。
counties_selected %>%
select(region, state, county, metro, population, walk) %>%
group_by(region) %>%
top_n(1,walk)
# A tibble: 4 x 6
# Groups: region [4]
region state county metro population walk
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 West Alaska Aleutians East Borough Nonmetro 3304 71.2
2 Northeast New York New York Metro 1629507 20.7
3 North Central North Dakota McIntosh Nonmetro 2759 17.5
4 South Virginia Lexington city Nonmetro 7071 31.7
同理,如果想要看每个组里排名前2的数据的话,n=2就可以。