【R】《R for Data Science》学习笔记-数据探索

2018-10-26  本文已影响29人  沈梦圆1993

本篇的学习目的是快速掌握数据探索的工具。观察数据、提出假设、快速检验,重复、重复、重复。提出越多假设,探索数据就更为深入。

data-science-explore

Data visualisation

R有好几个绘图系统,我们学习用ggplot2进行数据的可视化,因为它最为优雅且全能的。如果想要学它的绘图理论可以看 ”The Layered Grammar of Graphics“。

Question

# install.packages("tidyverse")
library(tidyverse)
# -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
# √ ggplot2 2.2.1     √ purrr   0.2.4
# √ tibble  1.3.4     √ dplyr   0.7.4
# √ tidyr   0.7.2     √ stringr 1.2.0
# √ readr   1.1.1     √ forcats 0.2.0
# -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
#   x dplyr::filter() masks stats::filter()
# x dplyr::lag()    masks stats::lag()

tidyverse能够帮助我们安装和加载一系列数据处理的包,方便快捷。

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

发动机大小和油耗直接的散点图,可以知道发动机越大,油耗相对越小。

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

以上是绘图模板。

Aesthetic

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

facet_grid(class ~ . )class在左边表示按行分面,如果在右边就是按列分面;

Geometric objects

Stat

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

Position

Coordinate systems

Summary

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
visualization-grammar-1 visualization-grammar-2 visualization-grammar-3

orkflow: basics

Data transformation

library(nycflights13)
library(tidyverse)

dplyr basics

logical
x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights,year %in% c(2013,2014)
filter(flights,between(year,2013,2014)
filter(flights,year == 2013 | year == 2014)
  by_dest <- group_by(flights, dest)
  delay <- summarise(by_dest,
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
  delay <- filter(delay, count > 20, dest != "HNL")

以上等同于,下面的pipe:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

  # It looks like delays increase with distance up to ~750 miles 
  # and then decrease. Maybe as flights get longer there's more 
  # ability to make up delays in the air?
  ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
    geom_point(aes(size = count), alpha = 1/3) +
    geom_smooth(se = FALSE)
  #> `geom_smooth()` using method = 'loess'

我的微信公众号

如果实在有需要请给我发邮件:mengyuanshen@126.com
也可以关注我的公众号:沈梦圆(PandaBiotrainee)

上一篇 下一篇

猜你喜欢

热点阅读