[R语言] Statistics and R - Week 2-
Week 2: Random Variables and Central Limit Theorem
Part 2: Central Limit Theorem - t-tests
参考书籍:《Data Analysis for the Life Sciences》
参考视频:
Central Limit Theorem
The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average
of a random sample follows a normal distribution centered at the population average
and with standard deviation equal to the population standard deviation
, divided by the square root of the sample size
. We refer to the standard deviation of the distribution of a random variable as the random variable's standard error.
Mathematically:
- If
is a random variable with mean
and
is a constant, the mean of
is
.
- If
is a random variable with mean
and SD
, and
is a constant, then
the mean and SD ofare
and
respectively.
If we take many samples of size , then the quantity:
is approximated with a normal distribution centered at 0 and with standard deviation 1.
Sample standard deviations:
So we can redefine our ratio as:
if or in general,
The null distribution is approximated by a normal distribution with mean 0 and variance 1
The t-distribution
When the original population from which a random variable, say , is sampled is normally distributed with mean 0, then we can calculate the distribution of:
William Sealy Gosset, an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym "Student". The distribution is therefore called
Student's t-distribution.
dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)
qq-plots to confirm that the distributions are relatively close to being normally distributed.
It compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.
mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)
- Exercises
# 数据来源
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/mice_pheno.csv"
filename <- basename(url)
download(url, destfile=filename)
dat <- na.omit( read.csv(filename) )
> 1
pnorm(1) - pnorm(-1)
# [1] 0.6826895
> 2
pnorm(2) - pnorm(-2)
# [1] 0.9544997
> 3
pnorm(3) - pnorm(-3)
# [1] 0.9973002
> 4
library(rafalib)
y <- filter(dat,Sex == "M" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
z <- (y - mean(y)) / popsd(y)
mean(abs(z) < 1)
# [1] 0.6950673
> 5
mean(abs(z) < 2)
# [1] 0.9461883
> 6
mean(abs(z) < 3)
# [1] 0.9910314
> 7
mypar()
qqnorm(z)
abline(0,1)
# C) The mouse weights are well approximated by the normal distribution, although the larger values (right tail) are larger than predicted by the normal. This is consistent with the differences seen between question 3 and 6
> 8
mypar(2,2)
y <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
# B) This just happens to be how nature behaves. Perhaps the result of many biological factors averaging out.
> 9
y <- filter(dat, Sex=="M" & Diet=="chow") %>%
select(Bodyweight) %>%
unlist
set.seed(1)
avgs <- replicate(10000, mean( sample(y, 25)))
mypar(1,2)
hist(avgs)
qqnorm(avgs)
qqline(avgs)
mean(avgs)
# [1] 30.96856
> 10
# 因为是样本抽取,应该用sd而不是popsd
sd(avgs)
# [1] 0.8271233
> 11
sd(avgs)
# [1] 0.8271233
popsd(y)/sqrt(25)
# [1] 0.8841001
# D) popsd(y)/sqrt(25)
> 12
y <- filter(dat, Sex=="M" & Diet=="chow") %>%
select(Bodyweight) %>%
unlist
set.seed(1)
sds <- replicate(10000, sd(sample(y, 25)))
# 一定为正数,不需要abs
mean(sds < 3.5)
# [1] 0.0942
> 13
x=seq(0.0001, 0.9999, len=300)
var(qt(x, df = 3))
# [1] 5.512038
var(qt(x, df = 10))
# [1] 1.393565
var(qt(x, df = 30))
# [1] 1.145476
var(qt(x, df = 100))
# [1] 1.080532
var(qnorm(x))
# [1] 1.055202
# C) The t-distribution has larger tails up until 30 degrees of freedom, at which point it is practically the same as the normal distribution.