DataCamp课程 <用dplyr操作数据> Chapter2

2021-07-15  本文已影响0人  Jason数据分析生信教室

用dplyr操作数据课程目录

Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练

Chapter2. 数据统计

count()函数计算频次

续Chapter1,首先选取多个变量形成新的数据集。然后计算各个region出现的频次。使用sort进行排序。

counties_selected <- counties %>%
  select(county, region, state, population, citizens)

counties_selected %>%
  count(region,sort=T)
# A tibble: 4 x 2
  region            n
  <chr>         <int>
1 South          1420
2 North Central  1054
3 West            447
4 Northeast       217

设置wt来给排序添加加权参数。可以让数据按照wt来排列。

counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
  count(state, wt = population_walk, sort = TRUE)
# A tibble: 50 x 2
   state                n
   <chr>            <dbl>
 1 New York      1237938.
 2 California    1017964.
 3 Pennsylvania   505397.
 4 Texas          430783.
 5 Illinois       400346.
 6 Massachusetts  316765.
 7 Florida        284723.
 8 New Jersey     273047.
 9 Ohio           266911.
10 Washington     239764.
# ... with 40 more rows

mutatecount组合使用

新增一个变量,根据新增的变量给state加权,并排序。

counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
  count(state, wt = population_walk, sort = TRUE)

summarizegroup_by进行描述行统计

group_by根据state进行分组,然后用summarize进行描述行统计。计算出对象数据的和。

# Group by state and find the total area and population
counties_selected %>%
   select(state, county, population, land_area) %>% 
   group_by(state) %>% 
   summarize(total_area=sum(land_area),total_population=sum(population))
# A tibble: 1 x 3
  min_population max_unemployment average_income
           <dbl>            <dbl>          <dbl>
1             85             29.4         46832.

再来一个稍微复杂的练习,先根据region, state对数据进行分组,然后对每组的population进行求和统计结果命名为total_pop,最后计算total_pop的平均值和中位数。

# Calculate the average_pop and median_pop columns 
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>% 
  summarize(average_pop = mean(total_pop),median_pop=median(total_pop))
# A tibble: 4 x 3
  region        average_pop median_pop
  <chr>               <dbl>      <dbl>
1 North Central    5627687.    5580644
2 Northeast        6221058.    3593222
3 South            7370486     4804098
4 West             5722755.    2798636

top_n 的用法

top_n相当于分组以后选取每个小组里某个变量名列前n的数据,n可以是任意数,根据自己需求设置。
比方说下面的例子,根据region分组,并选取每个组里,walk里的最大值。这时n=1。

counties_selected %>%
  select(region, state, county, metro, population, walk) %>% 
  group_by(region) %>% 
  top_n(1,walk)
# A tibble: 4 x 6
# Groups:   region [4]
  region        state        county                 metro    population  walk
  <chr>         <chr>        <chr>                  <chr>         <dbl> <dbl>
1 West          Alaska       Aleutians East Borough Nonmetro       3304  71.2
2 Northeast     New York     New York               Metro       1629507  20.7
3 North Central North Dakota McIntosh               Nonmetro       2759  17.5
4 South         Virginia     Lexington city         Nonmetro       7071  31.7

同理,如果想要看每个组里排名前2的数据的话,n=2就可以。

上一篇下一篇

猜你喜欢

热点阅读