RNA-seq小考核----生信技能树
生信技能树----RNA-seq小考核。
题目链接地址为:http://www.bio-info-trainee.com/3920.html
教程对应B站:【生信技能树】转录组测序数据分析
链接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
代码参考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq
Q1: 参考基因组及注释文件下载地址
列出人,小鼠,拟南芥的基因组序列,转录组cDNA序列,基因组注释gtf文件下载地址
参考基因组存储网站三个:ENSEMBL、NCBI、UCSC
mkdir rnaseq_test
cd rnaseq_test
#拟南芥
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz
#人
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
#小鼠
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.toplevel.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz
Q2: 找到文章的测序数据
2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing 使用成熟的单细胞转录组( Smart-seq2 )
手段探索了癌相关的成纤维细胞 CAFs的功能和空间异质性。
在文献中找到了:GSE111229
然后进图NCBI-GEO数据库中搜索:GSE111229
下载含有SRR编号的SRR_Acc_List.txt文件
下载方法步骤如下图所示:
1563964898(1).png
1563964927(1).png
1563964959(1).png
1563976393(1).png
Q3:下载测序数据
#写一个while循环或者for循环,批量进行测序数据的下载,三种方式均可进行批量操作,实际上取任何一种方法操作即可。
cat SRR_Acc_List.txt|while read id ;do (prefetch ${id} );done
for i in `less SRR_Acc_List.txt`;do (prefetch ${id} ) ;done
for i in SRR679{0711..1478};do ( prefetch ${i} );done
Q4: 任意挑选6个样本走标准的RNA-seq上游流程
即 sra → fastq→bam→counts
注意每个步骤的质控细节,注意每个步骤的文件格式转换背后的生物学意义。
代码参考在:code
sra → fastq→bam→counts
#当前文件夹内容
(rna) yliu 22:35:41 ~/rnaseq_test
$ tree
.
├── sra_data
│ ├── SRR6790711.sra
│ ├── SRR6790712.sra
│ ├── SRR6790713.sra
│ ├── SRR6790714.sra
│ ├── SRR6790715.sra
│ └── SRR6790716.sra
└── SRR_Acc_List.txt
1 directory, 7 files
#sra → fastq
##主要应该注意sratoolkit软件的用法(软件参数的使用,文件的输入和输出等等)
cd sra_data/
ls *.sra|while read id;do (fastq-dump --split-3 --gzip ${id});done
ls
#fastq→bam
##mapping之前需要对raw_data进行质量控制
##raw_fastq→clean_fastq
##quanlity control
mkdir fastqc
ls *.gz|while read id;do (fastqc -t 2 -o ~/rnaseq_test/fastqc/ ${id});done
cd fastqc
multiqc ./*.zip
##filter data
##对raw_data进行质量控制后,要对其进行过滤和截断,确保mapping时数据的可靠性
cd ../
mkdir clean_read/
for i in `ls *_1.fastq.gz`; do i=${i/_*/}; echo "trim_galore -q 25 \
--phred33 --length 36 --stringency 3 --paired \
-o ./clean_read/ ${i}_1.fastq.gz ${i}_2.fastq.gz"; done > ./clean_read/command_trim_galore.sh
less -S ./clean_read/command_trim_galore.sh
cd ./clean_read/
ls
chmod 755 command_trim_galore.sh
bash command_trim_galore.sh
##mapping
mapping之前需要先构建参考基因组index
这里使用服务器已经构建好的参考基因组索引
hg38的参考基因组索引位置为:
/teach/rna_testdata/database/index/hisat/hg38/genome
mkdir hisat_mapping
ls *_val_1.fq.gz|while read id; do id=${id%%_*}; echo "hisat2 -p 4 -x /teach/rna_testdata/database/index/hisat/hg38/genome\
-1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz\
-S /home/yliu/rnaseq_test/sra_data/hisat_mapping/${id}.sam"; done > command_hisat_list.sh
less command_hisat_list.sh
nohup bash command_hisat_list.sh &
#sam→bam
##sam文件占据内存非常巨大,因此将sam文件合适转化为bam文件格式(二进制格式),缩小文件的所占的空间大小
cd ../hisat_mapping
ls
ls *.sam|while read id;do echo "samtools sort -O bam -@ 4 -o ${id%%.*}.bam ${id} ";done > command_sam_bam.list
less command_sam_bam.list
nohup bash command_sam_bam.list &
top
#bam→counts
##从bam文件中获取每一个基因的counts值
mkdir ../counts
ls *.sam|while read id;do echo "featureCounts -T 5 -p -t exon -g gene_id -a /teach/rna_testdata/database/gtf/gencode.v25.annotation.gtf.gz -o ../counts/${id%%.*}.count.txt ${id}";done > command.counts.list
less command.counts.list
nohup bash command.counts.list &
top
Q5: 理解RNA-seq上游流程得到的表达矩阵的多种形式
在RNA-seq中,我们通常使用分析得到的count值,
以及RPKM、FPKM、TPM等均一化后的基因表达单位,去表示表达矩阵。
·count:
每个基因上比对到的reads的数量
·RPKM(Reads Per Kilobase Million)
每千个碱基的转录每百万映射读取的reads数
·FPKM(Fragments Per Kilobase Million)
每千个碱基的转录每百万映射读取的fragments数
·TPM(Transcripts Per Million)
每百万条reads的转录本
在RNA-seq数据分析中,对基因或者转录本的read counts数目进行标准化(Normalization)是一个非常重要的步骤。
因为在转录组数据分析中,落在一个基因上面的count值与基因长度和测序深度有关。
一个基因的长度越长,落在基因上面的reads数目就越多;测序深度越多,落在基因上面reads数目就越多。因此在做基因表达差异分析的时候,我们必须要对其进行基因长度和测序深度的标准化。
Q6: 任取6个样本表达矩阵随意分成2组走差异分析代码
这里使用了Bioconductor中的airway数据进行差异表达分析 (limma包)
rm(list = ls())
options(stringsAsFactors = F)
suppressMessages(library(airway))
data("airway")
#获取表达矩阵
exprSet <- assay(airway)
exprSet[1:4,1:4]
#获取分组信息
group_list <- colData(airway)[,3]
group_list
table(group_list)
#得到表达量矩阵后,需要对表达量进行筛选
table(apply(exprSet,1,function(x) sum(x>1)))
table(apply(exprSet,1, function(x) sum(x>1)==0))
exprSet=exprSet[apply(exprSet,1, function(x) sum(x>1) > 4),]
dim(exprSet)
[1] 20384 8
#计算样本间的相关性
library(edgeR)
exprSet1=cpm(exprSet,log = T,prior.count=2)
dim(exprSet1)
#取mad前1000的基因,因为这些基因在转录组的结果分析中,非常重要
tt=sort(apply(exprSet1, 1,mad),decreasing = T)[1:1000]
exprSet1=exprSet1[names(tt),]
cor(exprSet1)
pheatmap::pheatmap(cor(exprSet1))
corplot_pheatmap.png
差异表达分析---limma
suppressMessages(library(limma))
dim(exprSet)
#1.创建分组信息
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
design
trt untrt
SRR1039508 0 1
SRR1039509 1 0
SRR1039512 0 1
SRR1039513 1 0
SRR1039516 0 1
SRR1039517 1 0
SRR1039520 0 1
SRR1039521 1 0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(group_list)`
[1] "contr.treatment"
#2.处理表达矩阵,过滤表达量太低的基因
dge <- DGEList(counts=exprSet)
dge <- calcNormFactors(dge)
dim(dge)
#3.差异表达分析
comp='trt-untrt'
v <- voom(dge,design,plot=TRUE, normalize="quantile")
fit <- lmFit(v, design)
cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
fit2
#4.分析结束,拿到最终结果
tempOutput = topTable(fit2, coef=comp, n=Inf)
DEG_limma_voom = na.omit(tempOutput)
head(DEG_limma_voom)
nrDEG=DEG_limma_voom[,c(1,4)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DEG_limma_voom, file = "limma_DEG.Rdata")
dim(nrDEG)
[1] 20384 2
write.csv(nrDEG,file = 'limma_DEG.csv')
#查看上调或者下调基因的个数
table(abs(nrDEG[,1]) >1 & nrDEG[,2] < 0.05 )
FALSE TRUE
19119 1265
差异表达分析---DESEQ2
#差异表达基因---DESeq2
suppressMessages(library(DESeq2))
#1.创建分组信息
colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)
colData
#2.按照DESeq2包的方法处理数据
dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,
design = ~ group_list)
dds_2 <- DESeq(dds)
dds=dds_2
res <- results(dds,contrast=c("group_list","trt","untrt"))
resOrdered <- res[order(res$padj),]
DEG =as.data.frame(resOrdered)
DESeq2_DEG = na.omit(DEG)
nrDEG=DESeq2_DEG[,c(2,6)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DESeq2_DEG, file = "DESeq_DEG.Rdata")
head(nrDEG)
table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
FALSE TRUE
17502 470
差异表达分析---edgeR
suppressMessages(library(edgeR))
d <- DGEList(counts=exprSet,group=factor(group_list))
keep <- rowSums(cpm(d)>1) >= 2
table(keep)
d <- d[keep, , keep.lib.sizes=FALSE]
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d)
d$samples
dge=d
design <- model.matrix(~0+factor(group_list))
rownames(design)<-colnames(dge)
colnames(design)<-levels(factor(group_list))
dge=d
dge <- estimateGLMCommonDisp(dge,design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast=c(1,-1))
nrDEG=topTags(lrt, n=nrow(dge))
nrDEG=as.data.frame(nrDEG)
head(nrDEG)
edgeR_DEG =nrDEG
nrDEG=edgeR_DEG[,c(1,5)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, edgeR_DEG, file = "edgeR_DEG.Rdata")
table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
FALSE TRUE
14614 421
pheatmap 绘制
#pheatmap绘制
rm(list = ls())
options(stringsAsFactors = F)
#加载limma的差异表达分析后的数据
load(file = 'limma_DEG.Rdata')
library(pheatmap)
dim(nrDEG)
choose_name <- head(rownames(nrDEG),100)
choose_matirx <- exprSet[choose_name,]
choose_matirx=t(scale(t(choose_matirx)))
head(choose_matirx)
dim(choose_matirx)
#设置好分组信息
table(group_list)
group_list <- as.data.frame(group_list)
rownames(group_list) <- colnames(choose_matirx)
group_list
pheatmap(choose_matirx,annotation_col = group_list,fontsize_col = 0.5)
pheatmap.png
火山图绘制
#火山图的绘制
rm(list = ls())
options(stringsAsFactors = F)
load(file = 'limma_DEG.Rdata')
library(ggplot2)
nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange >1,ncol(nrDEG)+1]="Up"
nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange < -1,ncol(nrDEG)]="Down"
nrDEG[nrDEG$pvalue>=0.05 | 1 > abs(nrDEG$log2FoldChange),ncol(nrDEG)]="Normal"
colnames(nrDEG)[ncol(nrDEG)]="Regulate"
nrDEG$Regulate=factor(nrDEG$Regulate,levels = c("Up","Down","Normal"),order=T)
head(nrDEG)
col=c("red","lightseagreen", "black")
p_volcano=ggplot(nrDEG,aes(x=log2FoldChange,y=-log10(pvalue)))+
geom_point(aes(color=nrDEG$Regulate),alpha=0.5)+scale_color_manual(values =col)+
geom_hline(yintercept=c(-log10(0.05)),linetype=4)+
geom_vline(xintercept=c(-log2(2),log2(2)),linetype=4)+
theme(
legend.text = element_text(size = 10, face = "bold"),
legend.position = 'right',
legend.key.size=unit(0.5,'cm'),
axis.text.x=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
axis.text.y=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
axis.title.x = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
axis.title.y = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
panel.background = element_rect(fill = "transparent",colour = "black"),
panel.grid.minor = element_line(color="lightgrey",size=0.1),
panel.grid.major = element_line(color="lightgrey",size=0.1),
plot.background = element_rect(fill = "transparent",colour = "black")
)
p_volcano
volcanno.png
Q7:挑选差异分析结果的统计学显著上调下调基因集
在此使用limma包的差异分析结果进行结果的展示
rm(list = ls())
options(stringsAsFactors = F)
load(file = 'limma_DEG.Rdata')
head(nrDEG)
nrDEG[nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05, ncol(nrDEG)+1]='UP'
nrDEG[nrDEG$log2FoldChange < -1 & nrDEG$pvalue < 0.05, ncol(nrDEG)] = 'DOWN'
nrDEG[abs(nrDEG$log2FoldChange) <1 | nrDEG$pvalue > 0.05,ncol(nrDEG)]='NORMAL'
table(nrDEG$V3)
dim(nrDEG)
head(nrDEG)
write.csv(nrDEG,file = 'limma_DEG_Regulation.csv')
既然得到了上调和下调的差异表达基因,接下来就要对差异表达的基因,进行GO和KEGG的注释分析
#kegg
kk.up <- enrichKEGG( gene = gene_up ,
organism = 'hsa' ,
universe = gene_all ,
pvalueCutoff = 0.8 ,
qvalueCutoff = 0.8 )
kk.down <- enrichKEGG( gene = gene_down ,
organism = 'hsa' ,
universe = gene_all ,
pvalueCutoff = 0.05 )
head(kk.up)[,1:6]
head(kk.down)[,1:6]
#绘图
suppressMessages(library(ggplot2))
kegg_down_dt <- as.data.frame(kk.down)
kegg_up_dt <- as.data.frame(kk.up)
down_kegg <- kegg_down_dt[kegg_down_dt$pvalue<0.05,]
down_kegg$group=-1
up_kegg <- kegg_up_dt[kegg_up_dt$pvalue<0.05,]
up_kegg$group=1
dat=rbind(up_kegg,down_kegg)
dat$pvalue=-log10(dat$pvalue)
dat$pvalue = dat$pvalue * dat$group
dat = dat[ order( dat$pvalue, decreasing = F ), ]
g_kegg <- ggplot( dat,
aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) +
geom_bar( stat = "identity" ) +
scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) +
scale_x_discrete( name = "Pathway names" ) +
scale_y_continuous( name = "log10P-value" ) +
coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
ggtitle( "Pathway Enrichment" )
g_kegg
g_kegg.png
#go
g_list=list(gene_up=gene_up, gene_down=gene_down, gene_diff=gene_diff)
go_enrich_results <- lapply( g_list , function(gene) {
lapply( c('BP','MF','CC') , function(ont) {
cat(paste('Now process ',ont ))
ego <- enrichGO(gene = gene,
universe = gene_all,
OrgDb = org.Hs.eg.db,
ont = ont ,
pAdjustMethod = "BH",
pvalueCutoff = 0.99,
qvalueCutoff = 0.99,
readable = TRUE)
print( head(ego) )
return(ego)
})
})
save(go_enrich_results,file = 'go_enrich_results.Rdata')
n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC')
for (i in 1:3){
for (j in 1:3){
fn=paste0('8.GO_dotplot_',n1[i],'_',n2[j],'.png')
cat(paste0(fn,'\n'))
png(fn,res=150,width = 1080)
print(dotplot(go_enrich_results[[i]][[j]]))
dev.off()
}
}
GO_dotplot_gene_diff_BP.png
Q8: 直接对任取6个样本表达矩阵做GSVA分析
最后的GSVA分析,不是很懂这一部分的知识点,要学习一下再回来接着写!