从一篇文章学习lncRNA的鉴定流程
文章来自:Global identification of Arabidopsis lncRNAs reveals the regulation of MAF4 by a natural antisense RNA
https://www.nature.com/articles/s41467-018-07500-7#Sec11
2018年的NC,NC期刊的文章对于分析的步骤给的很详细!
一,组装转录本
1,QC
reads质量在20以上的用于分析
2,比对到参考基因组
tophat 2.0.10
TAIR10
参数设置:
Parameters were set for strand-specific mapping and up to 5 different alignments were allowed for a given read;
Annotations in TAIR10 served as an additional junction set to facilitate the alignment
3,组装转录本
使用cufflinks 1.3.0组装;
组装方法:
对于上一步比对上的reads进行组装成转录本(基于参考注释的组装RABT assembly )
参数设置:
Putative transcripts were retrieved with the parameter ‘--min-frags-per-transfrag 1’.
最后,合并每个转录本为一个:
软件:Cuffmerge utility version v1.0.0
Finally, assembled transcripts from each dataset and the reference annotation were merged into a unified transcriptome using Cuffmerge utility version v1.0.0
二,鉴别lncRNA
Identification of Arabidopsis lncRNAs
image.png
大致流程:
去除已知的非lncRNA
不可靠的表达量低的转录本
编码的RNA
详细步:
1,cufflinks中的分类u,x,i的保留
2,小于150nt或者FPKMmax<1,FPKMmax stands for the maximum expression level of a lncRNA from all samples
3,编码蛋白的RNA
编码蛋白的RNA怎样界定?
转录本进行blastx 在swiss-prot数据库中比对所有的植物序列数据
a cutoff e-value < 10-4
the transcripts with strong hits (alignment length ≥40 aa, percent identity ≥35% and coverage of the alignment region in either query or subject sequence ≥35%)
对lncRNA进行确认
(2) the CPC (Coding Potential Calculator) score71, a value to assess protein-coding potential of a transcript based on six biologically meaningful sequence features, was calculated for each transcript. When the CPC score is positive, we considered the transcript to have protein-coding potential. Transcripts that passed the three filtering steps were annotated as lncRNAs.
CPC>0
这个流程还是非常清晰的。在筛选后还进行了数据库的比对,去除了满足条件的一些编码RNA。最后又使用CPC进行了筛选。所以文章说采用了非常严格的一套流程。。