kaggle案例重复：核电站在世界范围内的分布

2019-05-04 本文已影响13人小明的数据分析笔记本

原文地址
https://www.kaggle.com/jonathanbouchet/nuclear-power-plant-geo-data
Nuclear Power Plant Locations data

新遇到的R包

skimr : skimr is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors.
简单理解skim()函数是summary()函数的升级版
运行help(package="skimr")命令查看帮助文档里面提供的小例子

>summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

>fivenum(iris$Sepal.Length)
[1] 4.3 5.1 5.8 6.4 7.9

>skim(iris)
Skim summary statistics
 n obs: 150 
 n variables: 5 

-- Variable type:factor --------------------------------------------------------
 variable missing complete   n n_unique                       top_counts ordered
  Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0   FALSE

-- Variable type:numeric -------------------------------------------------------
     variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
 Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
  Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂
 Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9 ▂▇▅▇▆▅▂▂
  Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4 ▁▂▅▇▃▂▁▁
>

lubridate: Functions to work with data-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a data-time.简单理解就是提供处理时间格式的函数

> ymd("20110604")
[1] "2011-06-04"
> mdy("06-04-2011")
[1] "2011-06-04"
> dmy("04/06/2011")
[1] "2011-06-04"
>

viridis：调色板 The viridis color palettes: Use the color scales in this package to make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale.

ggplot(mtcars,aes(wt,mpg))+
  geom_point(size=4,aes(colour=factor(cyl)))+
  scale_color_viridis_d()+theme_bw()

Rplot.png

broom：Convert Statistical Analysis Objects into Tidy Tibbles.将统计计算结果装换成数据框格式

> lmfit<-lm(mpg~wt,mtcars)
> lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

> summary(lmfit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

> broom::tidy(lmfit)
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10
> broom::glance(lmfit)
# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
*     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl>
1     0.753         0.745  3.05      91.4 1.29e-10     2  -80.0  166.
# ... with 3 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>
> broom::augment(lmfit)
# A tibble: 32 x 10
   .rownames   mpg    wt .fitted .se.fit .resid   .hat .sigma .cooksd
 * <chr>     <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 Mazda RX4  21    2.62    23.3   0.634 -2.28  0.0433   3.07 1.33e-2
 2 Mazda RX~  21    2.88    21.9   0.571 -0.920 0.0352   3.09 1.72e-3
 3 Datsun 7~  22.8  2.32    24.9   0.736 -2.09  0.0584   3.07 1.54e-2
 4 Hornet 4~  21.4  3.22    20.1   0.538  1.30  0.0313   3.09 3.02e-3
 5 Hornet S~  18.7  3.44    18.9   0.553 -0.200 0.0329   3.10 7.60e-5
 6 Valiant    18.1  3.46    18.8   0.555 -0.693 0.0332   3.10 9.21e-4
 7 Duster 3~  14.3  3.57    18.2   0.573 -3.91  0.0354   3.01 3.13e-2
 8 Merc 240D  24.4  3.19    20.2   0.539  4.16  0.0313   3.00 3.11e-2
 9 Merc 230   22.8  3.15    20.5   0.540  2.35  0.0314   3.07 9.96e-3
10 Merc 280   19.2  3.44    18.9   0.553  0.300 0.0329   3.10 1.71e-4
# ... with 22 more rows, and 1 more variable: .std.resid <dbl>

新遇到的函数

left_join简单理解就是按照相同的列合并两个数据框

使用dplyr::rename函数的时候报错Error: `petal_length` = Petal.Length must be a symbol or a string, not a formula；搜索报错找到了一个解决办法https://stackoverflow.com/questions/47755534/dplyr-rename-error-new-name-old-name-must-be-a-symbol-or-a-string-not-fo
自己把R由R-3.4.2换成了R-3.5.1就不在有这个报错了

fortify()暂时还没有搞懂这个函数是什么作用，帮助文档中说这个函数可能会被舍弃 fortity may be deprecated in the future. I now recommend using the broom package

重复原文的两张地图

ggplot2画地图

library(rworldmap)
library(ggplot2)
worldMap <- fortify(map_data("world"), region = "region")
ggplot() + 
  geom_map(data = worldMap, 
           map = worldMap,aes(x = long, y = lat,
                              map_id = region, 
                              group = group),
           fill = "white", color = "black", size = 0.1) + 
  theme_fivethirtyeight(10)

Rplot01.png

核电站在全球范围的分布
数据整合的部分暂时跳过，有时间回头细看！

library(ggplot2)
library(rworldmap)
ggplot(res) + 
  geom_polygon(aes(x=long, y=lat,group=group,fill=totMWe),
               color='white', size=.1) + 
  theme_fivethirtyeight() + 
  theme(panel.grid.major = element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank()) + 
  scale_fill_gradientn(name="",
                       colors = rev(viridis::viridis(50))) + 
  guides(fill = guide_colorbar(barwidth = 20, barheight = .5)) + 
  labs(title="Nuclear power plant landscape in 2019", 
       subtitle='energy produced(MWe) by nuclear source from active powerplant')

Rplot02.png

根据上图可以得到的结论：
Top 3 producers: 美国；法国；中国
朝鲜：No production

欢迎大家关注我的公众号 小明的数据分析笔记本

公众号二维码.jpg

kaggle案例重复：核电站在世界范围内的分布

新遇到的R包

新遇到的函数

重复原文的两张地图

猜你喜欢

热点阅读