跟着Genes|Genomes|Genetics学数据分析：R语

2022-07-03 本文已影响0人小明的数据分析笔记本

论文

Sex-Specific Co-expression Networks and Sex-Biased Gene Expression in the Salmonid Brook Charr Salvelinus fontinalis

数据代码公开

https://github.com/bensutherland/sfon_wgcna

还有wgcna的代码，论文里对方法和结果部分介绍的还挺详细，可以对照着论文然后学习WGCNA的代码

今天的推文先学习差异表达分析的代码

论文中提供的原始count文件有100多个样本，数据量有点大。这里我只选择其中的20个样本。

读取表达量文件

library(readr)
my.counts<-read_csv("data/20220623/edgeR_counts.csv")
head(my.counts)
dim(my.counts)

对数据进行取整

library(tidyverse)
my.counts.round<- my.counts %>% 
  column_to_rownames("transcript.id") %>% 
  round()
dim(my.counts.round)
head(my.counts.round)

对数据进行过滤

这里的过滤标准我有点没看明白

library(edgeR)
edger.counts <- DGEList(counts = my.counts.round)
min.reads.mapping.per.transcript <- 10
cpm.filt <- min.reads.mapping.per.transcript / min(edger.counts$samples$lib.size) * 1000000
cpm.filt
min.ind <- 5

keep <- rowSums(cpm(edger.counts)>cpm.filt) >= min.ind
table(keep)
filtered.counts <- edger.counts[keep, , keep.lib.sizes=FALSE]
filtered.counts %>% class()
dim(filtered.counts)

filtered.counts <- calcNormFactors(filtered.counts, method = c("TMM"))
filtered.counts$samples

filtered.counts<-estimateDisp(filtered.counts)

将数据和样本信息结合

new.group.info<-read_csv("data/20220623/edgeR_group_info.csv")


identical(filtered.counts$samples %>% rownames(),
          new.group.info$file.name)
new.group.info$sex<-factor(new.group.info$sex,
                           levels = c("F","M"))
levels(new.group.info$sex)
design <- model.matrix(~filtered.counts$samples$group)
design
colnames(design)[2] <- "sex"

差异表达分析

fit <- glmFit(y = filtered.counts, design = design)
lrt <- glmLRT(fit)

result <- topTags(lrt, n = 1000000)

火山图

result$table %>% 
  mutate(change = case_when(
    PValue < 0.05 & logFC > 2 ~ "UP",
    PValue < 0.05 & logFC < -2 ~ "DOWN",
    TRUE ~ "NOT"
  )) -> DEG

table(DEG$change)

library(ggplot2)
ggplot(data=DEG,aes(x=logFC,
                   y=-log10(PValue),
                   color=change))+
  geom_point(alpha=0.8,size=3)+
  labs(x="log2 fold change")+ ylab("-log10 pvalue")+
  #ggtitle(this_title)+
  theme_bw(base_size = 20)+
  #theme(plot.title = element_text(size=15,hjust=0.5),)+
  scale_color_manual(values=c('#a121f0','#bebebe','#ffad21')) -> p1 

p1 +
  geom_vline(xintercept = 2,lty="dashed")+
  geom_vline(xintercept = -2,lty="dashed") -> p2

library(patchwork)
pdf(file = "edger_deg.pdf",
    width = 9.4,height = 4,family = "serif")
p1+p2+
  plot_layout(guides = "collect")
dev.off()

image.png

示例数据和代码可以在公众号后台回复20220625获取

欢迎大家关注我的公众号

小明的数据分析笔记本

小明的数据分析笔记本公众号主要分享：1、R语言和python做数据分析和数据可视化的简单小例子；2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记；3、生物信息学入门学习资料及自己的学习笔记！

跟着Genes|Genomes|Genetics学数据分析：R语

论文

数据代码公开

读取表达量文件

对数据进行取整

对数据进行过滤

将数据和样本信息结合

差异表达分析

火山图

猜你喜欢

热点阅读