生物信息学

RNA-seq小考核----生信技能树

2019-07-29  本文已影响0人  LiuYueRR

生信技能树----RNA-seq小考核。

题目链接地址为:http://www.bio-info-trainee.com/3920.html

教程对应B站:【生信技能树】转录组测序数据分析

链接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
代码参考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq

Q1: 参考基因组及注释文件下载地址

列出人,小鼠,拟南芥的基因组序列,转录组cDNA序列,基因组注释gtf文件下载地址

参考基因组存储网站三个:ENSEMBL、NCBI、UCSC

mkdir rnaseq_test
cd rnaseq_test
#拟南芥
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz
#人
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
#小鼠
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.toplevel.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz

Q2: 找到文章的测序数据

2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing 使用成熟的单细胞转录组( Smart-seq2 )手段探索了癌相关的成纤维细胞 CAFs的功能和空间异质性。
在文献中找到了:GSE111229
然后进图NCBI-GEO数据库中搜索:GSE111229
下载含有SRR编号的SRR_Acc_List.txt文件
下载方法步骤如下图所示:
1563964898(1).png 1563964927(1).png
1563964959(1).png
1563976393(1).png

Q3:下载测序数据

#写一个while循环或者for循环,批量进行测序数据的下载,三种方式均可进行批量操作,实际上取任何一种方法操作即可。
cat SRR_Acc_List.txt|while read id ;do (prefetch  ${id} );done
for i in `less SRR_Acc_List.txt`;do (prefetch  ${id} ) ;done
for i in SRR679{0711..1478};do ( prefetch ${i} );done

Q4: 任意挑选6个样本走标准的RNA-seq上游流程

即 sra → fastq→bam→counts

注意每个步骤的质控细节,注意每个步骤的文件格式转换背后的生物学意义。
代码参考在:code

sra → fastq→bam→counts
#当前文件夹内容
(rna) yliu 22:35:41 ~/rnaseq_test
$ tree
.
├── sra_data
│     ├── SRR6790711.sra
│     ├── SRR6790712.sra
│     ├── SRR6790713.sra
│     ├── SRR6790714.sra
│     ├── SRR6790715.sra
│     └── SRR6790716.sra
└── SRR_Acc_List.txt

1 directory, 7 files

#sra → fastq
##主要应该注意sratoolkit软件的用法(软件参数的使用,文件的输入和输出等等)
cd sra_data/
ls *.sra|while read id;do (fastq-dump --split-3 --gzip  ${id});done
ls

#fastq→bam
##mapping之前需要对raw_data进行质量控制
##raw_fastq→clean_fastq
##quanlity control
mkdir fastqc
ls *.gz|while read id;do (fastqc  -t 2 -o ~/rnaseq_test/fastqc/ ${id});done
cd fastqc
multiqc ./*.zip

##filter data
##对raw_data进行质量控制后,要对其进行过滤和截断,确保mapping时数据的可靠性
cd ../
mkdir clean_read/
for i in `ls *_1.fastq.gz`; do i=${i/_*/}; echo "trim_galore -q 25 \ 
 --phred33 --length 36 --stringency 3 --paired \
 -o ./clean_read/ ${i}_1.fastq.gz ${i}_2.fastq.gz"; done > ./clean_read/command_trim_galore.sh

less -S ./clean_read/command_trim_galore.sh 
cd  ./clean_read/
ls
chmod 755 command_trim_galore.sh
bash command_trim_galore.sh

##mapping
mapping之前需要先构建参考基因组index
这里使用服务器已经构建好的参考基因组索引
hg38的参考基因组索引位置为:
/teach/rna_testdata/database/index/hisat/hg38/genome
mkdir hisat_mapping
ls *_val_1.fq.gz|while read id; do id=${id%%_*}; echo "hisat2 -p 4 -x /teach/rna_testdata/database/index/hisat/hg38/genome\
  -1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz\
  -S /home/yliu/rnaseq_test/sra_data/hisat_mapping/${id}.sam"; done > command_hisat_list.sh
less command_hisat_list.sh
nohup bash command_hisat_list.sh & 

#sam→bam
##sam文件占据内存非常巨大,因此将sam文件合适转化为bam文件格式(二进制格式),缩小文件的所占的空间大小
cd ../hisat_mapping
ls
ls *.sam|while read id;do echo "samtools sort -O bam -@ 4 -o ${id%%.*}.bam ${id} ";done > command_sam_bam.list
less command_sam_bam.list
nohup bash command_sam_bam.list &
top

#bam→counts
##从bam文件中获取每一个基因的counts值
mkdir ../counts
ls *.sam|while read id;do echo "featureCounts -T 5 -p -t exon -g gene_id -a /teach/rna_testdata/database/gtf/gencode.v25.annotation.gtf.gz -o ../counts/${id%%.*}.count.txt ${id}";done > command.counts.list
less command.counts.list
nohup bash command.counts.list &
top

Q5: 理解RNA-seq上游流程得到的表达矩阵的多种形式

在RNA-seq中,我们通常使用分析得到的count值,
以及RPKM、FPKM、TPM等均一化后的基因表达单位,去表示表达矩阵。
·count:
每个基因上比对到的reads的数量
·RPKM(Reads Per Kilobase Million)
每千个碱基的转录每百万映射读取的reads数
·FPKM(Fragments Per Kilobase Million)
每千个碱基的转录每百万映射读取的fragments数
·TPM(Transcripts Per Million)
每百万条reads的转录本

在RNA-seq数据分析中,对基因或者转录本的read counts数目进行标准化(Normalization)是一个非常重要的步骤。
因为在转录组数据分析中,落在一个基因上面的count值与基因长度和测序深度有关。
一个基因的长度越长,落在基因上面的reads数目就越多;测序深度越多,落在基因上面reads数目就越多。因此在做基因表达差异分析的时候,我们必须要对其进行基因长度和测序深度的标准化。

Q6: 任取6个样本表达矩阵随意分成2组走差异分析代码

这里使用了Bioconductor中的airway数据进行差异表达分析 (limma包)

rm(list = ls())
options(stringsAsFactors = F)
suppressMessages(library(airway))
data("airway")
#获取表达矩阵
exprSet <- assay(airway)
exprSet[1:4,1:4]
#获取分组信息
group_list <- colData(airway)[,3]
group_list
table(group_list)
#得到表达量矩阵后,需要对表达量进行筛选
table(apply(exprSet,1,function(x) sum(x>1)))
table(apply(exprSet,1, function(x) sum(x>1)==0))
exprSet=exprSet[apply(exprSet,1, function(x) sum(x>1) > 4),]
dim(exprSet)
[1] 20384     8
#计算样本间的相关性
library(edgeR)
exprSet1=cpm(exprSet,log = T,prior.count=2)
dim(exprSet1)
#取mad前1000的基因,因为这些基因在转录组的结果分析中,非常重要
tt=sort(apply(exprSet1, 1,mad),decreasing = T)[1:1000]
exprSet1=exprSet1[names(tt),]
cor(exprSet1)
pheatmap::pheatmap(cor(exprSet1))
corplot_pheatmap.png

差异表达分析---limma

suppressMessages(library(limma))
dim(exprSet)
#1.创建分组信息
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
design
           trt untrt
SRR1039508   0     1
SRR1039509   1     0
SRR1039512   0     1
SRR1039513   1     0
SRR1039516   0     1
SRR1039517   1     0
SRR1039520   0     1
SRR1039521   1     0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(group_list)`
[1] "contr.treatment"

#2.处理表达矩阵,过滤表达量太低的基因
dge <- DGEList(counts=exprSet)
dge <- calcNormFactors(dge)
dim(dge)
#3.差异表达分析
comp='trt-untrt'
v <- voom(dge,design,plot=TRUE, normalize="quantile")
fit <- lmFit(v, design)
cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
fit2
#4.分析结束,拿到最终结果
tempOutput = topTable(fit2, coef=comp, n=Inf)
DEG_limma_voom = na.omit(tempOutput)
head(DEG_limma_voom)
nrDEG=DEG_limma_voom[,c(1,4)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DEG_limma_voom, file = "limma_DEG.Rdata")
dim(nrDEG)
[1] 20384     2
write.csv(nrDEG,file = 'limma_DEG.csv')
#查看上调或者下调基因的个数
table(abs(nrDEG[,1]) >1 & nrDEG[,2] < 0.05 )
FALSE  TRUE 
19119  1265 

差异表达分析---DESEQ2

#差异表达基因---DESeq2
suppressMessages(library(DESeq2))
#1.创建分组信息
colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)
colData
#2.按照DESeq2包的方法处理数据
dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,
                              design = ~ group_list)
dds_2 <- DESeq(dds)
dds=dds_2
res <- results(dds,contrast=c("group_list","trt","untrt"))
resOrdered <- res[order(res$padj),]
DEG =as.data.frame(resOrdered)
DESeq2_DEG = na.omit(DEG)
nrDEG=DESeq2_DEG[,c(2,6)]
colnames(nrDEG)=c('log2FoldChange','pvalue')  
save(nrDEG, DESeq2_DEG, file = "DESeq_DEG.Rdata")
head(nrDEG)
table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
FALSE  TRUE 
17502   470 

差异表达分析---edgeR

suppressMessages(library(edgeR))
d <- DGEList(counts=exprSet,group=factor(group_list))
keep <- rowSums(cpm(d)>1) >= 2
table(keep)
d <- d[keep, , keep.lib.sizes=FALSE]
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d)
d$samples
dge=d
design <- model.matrix(~0+factor(group_list))
rownames(design)<-colnames(dge)
colnames(design)<-levels(factor(group_list))
dge=d
dge <- estimateGLMCommonDisp(dge,design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)

fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast=c(1,-1)) 
nrDEG=topTags(lrt, n=nrow(dge))
nrDEG=as.data.frame(nrDEG)
head(nrDEG)
edgeR_DEG =nrDEG 
nrDEG=edgeR_DEG[,c(1,5)]
colnames(nrDEG)=c('log2FoldChange','pvalue') 
save(nrDEG, edgeR_DEG, file = "edgeR_DEG.Rdata")
table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
FALSE  TRUE 
14614   421 

pheatmap 绘制

#pheatmap绘制
rm(list = ls())
options(stringsAsFactors = F)
#加载limma的差异表达分析后的数据
load(file = 'limma_DEG.Rdata')
library(pheatmap)
dim(nrDEG)
choose_name <- head(rownames(nrDEG),100)
choose_matirx <- exprSet[choose_name,]
choose_matirx=t(scale(t(choose_matirx)))
head(choose_matirx)
dim(choose_matirx)
#设置好分组信息
table(group_list)
group_list <- as.data.frame(group_list)
rownames(group_list) <- colnames(choose_matirx)
group_list
pheatmap(choose_matirx,annotation_col = group_list,fontsize_col = 0.5)
pheatmap.png

火山图绘制

#火山图的绘制
rm(list = ls())
options(stringsAsFactors = F)
load(file = 'limma_DEG.Rdata')
library(ggplot2)
nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange >1,ncol(nrDEG)+1]="Up"
nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange < -1,ncol(nrDEG)]="Down"
nrDEG[nrDEG$pvalue>=0.05 | 1 > abs(nrDEG$log2FoldChange),ncol(nrDEG)]="Normal"
colnames(nrDEG)[ncol(nrDEG)]="Regulate"
nrDEG$Regulate=factor(nrDEG$Regulate,levels = c("Up","Down","Normal"),order=T)
head(nrDEG)
col=c("red","lightseagreen", "black")
p_volcano=ggplot(nrDEG,aes(x=log2FoldChange,y=-log10(pvalue)))+
  geom_point(aes(color=nrDEG$Regulate),alpha=0.5)+scale_color_manual(values =col)+
  geom_hline(yintercept=c(-log10(0.05)),linetype=4)+
  geom_vline(xintercept=c(-log2(2),log2(2)),linetype=4)+
  theme(
    legend.text = element_text(size = 10, face = "bold"),
    legend.position = 'right',
    legend.key.size=unit(0.5,'cm'),
    axis.text.x=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
    axis.text.y=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
    axis.title.x = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
    axis.title.y = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
    
    panel.background = element_rect(fill = "transparent",colour = "black"), 
    panel.grid.minor = element_line(color="lightgrey",size=0.1),
    panel.grid.major = element_line(color="lightgrey",size=0.1),
    plot.background = element_rect(fill = "transparent",colour = "black")
  )
p_volcano
volcanno.png

Q7:挑选差异分析结果的统计学显著上调下调基因集

在此使用limma包的差异分析结果进行结果的展示

rm(list = ls())
options(stringsAsFactors = F)
load(file = 'limma_DEG.Rdata')
head(nrDEG)
nrDEG[nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05, ncol(nrDEG)+1]='UP'
nrDEG[nrDEG$log2FoldChange < -1 & nrDEG$pvalue < 0.05, ncol(nrDEG)] = 'DOWN'
nrDEG[abs(nrDEG$log2FoldChange) <1 | nrDEG$pvalue > 0.05,ncol(nrDEG)]='NORMAL'
table(nrDEG$V3)
dim(nrDEG)
head(nrDEG)
write.csv(nrDEG,file = 'limma_DEG_Regulation.csv')

既然得到了上调和下调的差异表达基因,接下来就要对差异表达的基因,进行GO和KEGG的注释分析

#kegg
kk.up <- enrichKEGG(   gene          =  gene_up    ,
                       organism      =  'hsa'      ,
                       universe      =  gene_all   ,
                       pvalueCutoff  =  0.8        ,
                       qvalueCutoff  =  0.8        )
kk.down <- enrichKEGG( gene          =  gene_down  ,
                       organism      =  'hsa'      ,
                       universe      =  gene_all   ,
                       pvalueCutoff  =  0.05       )

head(kk.up)[,1:6]
head(kk.down)[,1:6]
#绘图
suppressMessages(library(ggplot2))
kegg_down_dt <- as.data.frame(kk.down)
kegg_up_dt <- as.data.frame(kk.up)
down_kegg <- kegg_down_dt[kegg_down_dt$pvalue<0.05,]
down_kegg$group=-1
up_kegg <- kegg_up_dt[kegg_up_dt$pvalue<0.05,]
up_kegg$group=1
dat=rbind(up_kegg,down_kegg)
dat$pvalue=-log10(dat$pvalue)
dat$pvalue = dat$pvalue * dat$group
dat = dat[ order( dat$pvalue, decreasing = F ), ]
g_kegg <- ggplot( dat, 
                  aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) + 
  geom_bar( stat = "identity" ) + 
  scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) + 
  scale_x_discrete( name = "Pathway names" ) +
  scale_y_continuous( name = "log10P-value" ) +
  coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
  ggtitle( "Pathway Enrichment" )
g_kegg
g_kegg.png
#go
g_list=list(gene_up=gene_up, gene_down=gene_down, gene_diff=gene_diff)
go_enrich_results <- lapply( g_list , function(gene) {
  lapply( c('BP','MF','CC') , function(ont) {
    cat(paste('Now process ',ont ))
    ego <- enrichGO(gene          = gene,
                    universe      = gene_all,
                    OrgDb         = org.Hs.eg.db,
                    ont           = ont ,
                    pAdjustMethod = "BH",
                    pvalueCutoff  = 0.99,
                    qvalueCutoff  = 0.99,
                    readable      = TRUE)
    
    print( head(ego) )
    return(ego)
  })
})
save(go_enrich_results,file = 'go_enrich_results.Rdata')

n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC') 
for (i in 1:3){
  for (j in 1:3){
    fn=paste0('8.GO_dotplot_',n1[i],'_',n2[j],'.png')
    cat(paste0(fn,'\n'))
    png(fn,res=150,width = 1080)
    print(dotplot(go_enrich_results[[i]][[j]]))
    dev.off()
  }
}
GO_dotplot_gene_diff_BP.png

Q8: 直接对任取6个样本表达矩阵做GSVA分析

最后的GSVA分析,不是很懂这一部分的知识点,要学习一下再回来接着写!

上一篇下一篇

猜你喜欢

热点阅读