Data Analysis for the Life Science

[R语言] Statistics and R - Week 2-

2020-04-10  本文已影响0人  半为花间酒

Week 2: Random Variables and Central Limit Theorem

Part 2: Central Limit Theorem - t-tests


参考书籍:《Data Analysis for the Life Sciences》

参考视频:

  1. Data Analysis for the Life Sciences Series - rafalib
  2. Professional Certificate in Data Analysis for Life Sciences (Harvard University) - edX

Central Limit Theorem

The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average bar{Y} of a random sample follows a normal distribution centered at the population average \mu_Y and with standard deviation equal to the population standard deviation \sigma_Y, divided by the square root of the sample size N. We refer to the standard deviation of the distribution of a random variable as the random variable's standard error.

Mathematically:

  1. If X is a random variable with mean \mu and a is a constant, the mean of X - a is \mu-a.
  2. If X is a random variable with mean \mu and SD \sigma, and a is a constant, then
    the mean and SD of aX are a \mu and \mid a \mid \sigma respectively.

If we take many samples of size N, then the quantity: \frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}}
is approximated with a normal distribution centered at 0 and with standard deviation 1.

Sample standard deviations: s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2

So we can redefine our ratio as: \sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}}

if M=N or in general, \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}}

The null distribution is approximated by a normal distribution with mean 0 and variance 1

The t-distribution

When the original population from which a random variable, say Y, is sampled is normally distributed with mean 0, then we can calculate the distribution of:

\sqrt{N} \frac{\bar{Y}}{s_Y}

William Sealy Gosset, an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym "Student". The distribution is therefore called Student's t-distribution.

dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)

library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist

library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)

qq-plots to confirm that the distributions are relatively close to being normally distributed.
It compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.

mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)

- Exercises

# 数据来源
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/mice_pheno.csv"
filename <- basename(url)
download(url, destfile=filename)
dat <- na.omit( read.csv(filename) )

> 1
pnorm(1) - pnorm(-1)
# [1] 0.6826895

> 2
pnorm(2) - pnorm(-2)
# [1] 0.9544997

> 3
pnorm(3) - pnorm(-3)
# [1] 0.9973002

> 4
library(rafalib)
y <- filter(dat,Sex == "M" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist
z <- (y - mean(y)) / popsd(y)
mean(abs(z) < 1)
# [1] 0.6950673

> 5
mean(abs(z) < 2)
# [1] 0.9461883

> 6
mean(abs(z) < 3)
# [1] 0.9910314

> 7
mypar()
qqnorm(z)
abline(0,1)
# C) The mouse weights are well approximated by the normal distribution, although the larger values (right tail) are larger than predicted by the normal. This is consistent with the differences seen between question 3 and 6
> 8
mypar(2,2)
y <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
# B) This just happens to be how nature behaves. Perhaps the result of many biological factors averaging out.
> 9
y <- filter(dat, Sex=="M" & Diet=="chow") %>% 
  select(Bodyweight) %>% 
  unlist
set.seed(1)
avgs <- replicate(10000, mean( sample(y, 25)))
mypar(1,2)
hist(avgs)
qqnorm(avgs)
qqline(avgs)

mean(avgs)
# [1] 30.96856
> 10
# 因为是样本抽取,应该用sd而不是popsd
sd(avgs)
# [1] 0.8271233

> 11
sd(avgs)
# [1] 0.8271233
popsd(y)/sqrt(25)
# [1] 0.8841001
# D) popsd(y)/sqrt(25)

> 12
y <- filter(dat, Sex=="M" & Diet=="chow") %>% 
  select(Bodyweight) %>% 
  unlist
set.seed(1)
sds <- replicate(10000, sd(sample(y, 25)))

# 一定为正数,不需要abs
mean(sds < 3.5)
# [1] 0.0942

> 13
x=seq(0.0001, 0.9999, len=300)
var(qt(x, df = 3))
# [1] 5.512038
var(qt(x, df = 10))
# [1] 1.393565
var(qt(x, df = 30))
# [1] 1.145476
var(qt(x, df = 100))
# [1] 1.080532

var(qnorm(x))
# [1] 1.055202
# C) The t-distribution has larger tails up until 30 degrees of freedom, at which point it is practically the same as the normal distribution.
上一篇 下一篇

猜你喜欢

热点阅读