生物信息学分析

star: 比对RNA-seq到参考基因组获取剪切事件

2023-01-01  本文已影响0人  JeremyL

# 安装

## \## 获取star源代码
# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.10b.tar.gz
tar -xzf 2.7.10b.tar.gz
cd STAR-2.7.10b

# Alternatively, get STAR source using git
# git clone https://github.com/alexdobin/STAR.git

## 编译

# Compile
cd STAR/source
make STAR

## 或者使用conda

conda install -c bioconda star

## 检查安装是否成功

STAR
Usage: STAR  [options]... --genomeDir /path/to/genome/index/   --readFilesIn R1.fq R2.fq
Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2022

STAR version=2.7.10b
STAR compilation time,server,dir=2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
For more details see:
<https://github.com/alexdobin/STAR>
<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>

To list all parameters, run STAR --help

# star 基本流程

## 建立基因组索引

## 比对reads到基因组

# 产生基因组索引

## 基本命令:

STAR --runThreadN 30 --runMode genomeGenerate \
 --genomeDir /dataBase/hg38StarIndex \
 --genomeFastaFiles /dataBase/GRCh38.p13.genome.fa \
 --sjdbGTFfile /dataBase/gencode.v42.annotation.gtf \
 --sjdbOverhang ReadLength-1

---runThreadN: 线程
--runMode:genomeGenerate 为产生基因组索引
--genomeDir:index文件输出位置
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ... 基因组fa文件
--sjdbGTFfile:基因组注释文件
--sjdbOverhang ReadLength-1 #已经注释的剪切位点两端的序列长度,一般是read 长度-1,当read 长度是变化的时候,就用最大的read 长度-1

## 基因组序列文件和注释文件的选择

### 基因组序列文件: chromosomes/scaffolds/patches 应该怎么选择?

### 基因组注释文件:

### 使用剪切注释数据

Chr \tab Start \tab End \tab Strand=+/-/.

Start 和End是内含子第一个和最后一个碱基(1-based chromosome coordinates);STAR可以从sjdbFile.txt和GTF文件中获取剪切事件。

### 较小的基因组

对于较小的基因组,参数--genomeSAindexNbases必须按比例缩小,为min(14,log2(GenomeLength)/2-1)。例如,对于1megaBase基基因组,使用9,对于100000基基因组,使用7。

### 基因组有许多序列片段

一个基因组有>5,000 chro- somes/scaffolds时,可能需要减少--genomeChrBinNb位以减少RAM消耗。建议使用以下缩放比例:--genomeChrBinNbits = min(18,log2[max(GenomeLength/NumberOfReferences, ReadLength)])。3 gigaBase基因组,有100000 条chromosomes/scaffolds,,可以设置为15。

# 比对

## 基本命令:

STAR --runThreadN 20 --genomeDir /dataBase/hg38StarIndex \
 --runMode alignReads \
 --quantMode TranscriptomeSAM GeneCounts \
 --readFilesIn  R1.fastq.gz R2.fastq.gz \
 --outSAMtype BAM SortedByCoordinate \
 --readFilesCommand gunzip -c \
 --outFileNamePrefix test

---runThreadN: 线程
--genomeDir:基因组index位置
--runMode:alignReads 比对基因组,与前面的genomeGenerate 不同;
--quantMode:定量类型;TranscriptomeSAM:输出转录组的比对情况到一个单独的文件;GeneCounts:每个基因的reads数目统计
--readFilesIn: 测序数据,可以是fasta和fastq格式的
--outSAMtype: 输出文件格式
--readFilesCommand: 测序文件是压缩文件的话,解压命令
--outFileNamePrefix: 输出文件前缀

# 比对结果

testLog.out: 程序运行日志
testLog.progress.out: 程序运行中实时统计
testLog.final.out: 比对结束之后的,比对情况统计
testAligned.sortedByCoord.out.bam:排序之后的比对结果    
testSJ.out.tab:剪切事件
testAligned.toTranscriptome.out.bam:转录组比对结果  
testReadsPerGene.out.tab: 每个基因read计数情况
Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Jan 01 15:02:35    426.6     8295517       79    35.8%     75.8     0.3%     7.5%     1.9%     0.0%    54.6%     0.2%
ALL DONE!
Started job on |       Jan 01 15:00:50
                             Started mapping on |       Jan 01 15:01:25
                                    Finished on |       Jan 01 15:02:35
       Mapping speed, Million of reads per hour |       426.63

                          Number of input reads |       8295517
                      Average input read length |       79
                                    UNIQUE READS:
                   Uniquely mapped reads number |       2971995
                        Uniquely mapped reads % |       35.83%
                          Average mapped length |       75.76
                       Number of splices: Total |       81528
            Number of splices: Annotated (sjdb) |       35799
                       Number of splices: GT/AG |       41246
                       Number of splices: GC/AG |       1103
                       Number of splices: AT/AC |       277
               Number of splices: Non-canonical |       38902
                      Mismatch rate per base, % |       0.30%
                         Deletion rate per base |       0.03%
                        Deletion average length |       1.67
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.29
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       618387
             % of reads mapped to multiple loci |       7.45%
        Number of reads mapped to too many loci |       154144
             % of reads mapped to too many loci |       1.86%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       4530903
                 % of reads unmapped: too short |       54.62%
                Number of reads unmapped: other |       20088
                     % of reads unmapped: other |       0.24%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%
chr1    15463   15623   1       1       0       0       1       21
第1列:染色体号
第2列:内含子第一个碱基 (1-based)
第3列:内含子最后一个碱基 (1-based)
第4列:链, (0: 为定义, 1: +, 2: -)
第5列:内含子基序:0:非规范;1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
第6列:0:未注释的,1:在剪切数据库中有注释的,2,在剪切数据库中有注释的,也在GTF文件中存在;
第7列:剪切事件上只有唯一匹配的reads
第8列:剪切事件上具有多次匹配的reads
第9列:maximum spliced alignment overhang
N_unmapped      4705135 4705135 4705135
N_multimapping  618387  618387  618387
N_noFeature     2046544 2864860 2107430
N_ambiguous     78709   2231    32042
ENSG00000290825.1       0       0       0
ENSG00000223972.6       0       0       0
ENSG00000227232.5       1       0       1
column 1: gene ID
column 2: counts for unstranded RNA-seq
18
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

#原文

Github: Spliced Transcripts Alignment to a Reference
STARmanual

上一篇下一篇

猜你喜欢

热点阅读