生物信息学分析RNASeq 数据分析NGS避坑指南

FPKM_cutoff

2019-05-31  本文已影响31人  superqun

原文:http://www.dxy.cn/bbs/thread/38676704#38676704

FPKM_cutoff

Meaningful FPKM cutoff?

FPKM > 1.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3233386/

We used a stringent 1.0 FPKM cutoff that generated a list of genes with significant base level expression and fewer false positives than a lower expression level threshold. If we had decreased our threshold to 0.1 FPKM, we would have detected 975 more DETs; however, these genes are expressed at an extremely low level and their impact must be weighed against the increase in false positives. We chose a conservative criterion to identify significant and bona fide differentially expressed genes.

Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs

In a final step, we removed transcripts expressed at fragments per kilobase of transcript per million mapped reads (FPKM)<1, a threshold approximately equivalent to one copy per cell [21]

Burrows–Wheeler transform-based short read aligner analysis workflow

Burrows–Wheeler Transform Aligner (BWA) [45] was used to align RNA-seq reads against the mouse reference genome (build mm9), downloaded and indexed from the University of California Santa Cruz (UCSC) genome browser database [46]. The resulting sequence alignment/map files were imported intoPartek Genomics Suite (Partek Inc., St. Louis, MO) to compute raw and fragments per kilobase of exon model per million mapped (FPKM) reads normalized expression values of the transcript isoforms defined in the UCSC refFlat file. A stringent filtering criterion of FPKM value 1.0 (equivalent to one transcript per cell [16]) in at least one out of six samples was used to obtain expressed transcripts. The FPKM values of the filtered transcripts were log-transformed using log2 (FPKM+offset) with an offset=1.0. ANOVA (ANOVA) was then performed on the log-transformed data of the two groups (WT and Nrl−/−) to generate fold change and p values for each transcript. Y-chromosome transcripts were filtered out along with non-coding (nc) RNAs, mitochondrial DNA coded genes, pseudogenes, and predicted protein-coding genes. Differentially expressed mRNA isoforms were filtered for a fold change cutoff of 1.5 and p-value cutoff of 0.05. These criteria were implemented to enable a comparison with previous expression studies. Hierarchical clustering was performed using Cluster 3.0 software [47]. We used uncentered correlation as the distance metric. Heatmaps and dendrograms were generated using JavaTreeView software [48]. Aligned reads were visualized using the Integrated Genomics Viewer (IGV) [49].

In paper "Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses", the authors used log2(FPKM+0.05) and log2(FPKM+1).

TopHat/Cufflinks-based analysis workflow

Raw reads that passed the chastity filter threshold were mapped using TopHat [50] to identify known and novel splice junctions and to generate read alignments for each sample. Genomic annotations were obtained from Ensembl in gene transfer format (GTF). Splice junctions from the six samples were combined into a master junctions file that was used as an input file for the second iteration of TopHat mapping. The transcript isoform level and gene level counts were calculated and FPKM normalized usingCufflinks. An FPKM filtering cutoff of 1.0 in at least one of the six samples was used to determine expressed transcripts. Differential transcript expression was then computed usingCuffdiff. The resulting lists of differentially expressed isoforms were filtered and sorted into the following categories: protein coding mRNA transcripts and ncRNA transcripts.

FPKM=0

If you actually go look at a few genes with an FPKM of 0 on something like IGV, you'll see that sometime there are a few reads that align to it. Its just that when you're talking about 100M or more reads per sample, 1 or 2 reads is pretty insignificant. So, while you can't really say its not expressed with 100% certainty, you can say it was not detected. The other measure you can look at is if the low of the 95% confidence interval is >0. Then you can confidently claim something is expressed.

FPKM > 0.05 -- {RNA-sequence analysis of human B-cells} Genome Research, Toung

Although we do not wish to use an arbitrary FPKM threshold to determine whether a transcript is expressed, analysis of all transcripts with expression levels greater than zero will include FPKM values that are very close to zero (bottom fifth percentile of transcript FPKM values = 0.003). Thus, we set an FPKM value of 0.05 as the lower bound in our subsequent analyses

上一篇 下一篇

猜你喜欢

热点阅读