测序数据简单分析流程

2021-03-25 本文已影响0人所以suoyi

2021/03/25

一、测序数据下载

NCBI SRA 数据库

sra

wget ftp://ftp.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz
tar -zxvf sratoolkit.current-centos_linux64.tar.gz

wget https://ftp.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -zxvf sratoolkit.current-ubuntu64.tar.gz

./prefetch SRR13948848  # 下载数据
md5sum -b  # md5校验，验证数据有没有下载完全
./fastq-dump -X 5 -Z SRR13948848  # 将sra转换为fastq

【使用说明】

二、fastqc （fastq文件）

（一）、下载和安装

下载

wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
chmod 755 fastqc
export PATH=/home/user/FastQC/:$PATH
source ~/.bashrc
fastqc --help

（二）、在 python 中使用

先写出需要的命令行

-o --outdir       输出路径
-f --format       指定输入文件的格式
--extract         生成的报告默认会打包成1个压缩文件，使用这个参数是让程序不打包
-t --threads      选择程序运行的线程数，每个线程会占用250MB内存
--min_length      设置序列的最小长度
-c --contaminants 污染物选项
Sequence          可能的污染序列
-a --adapters     adpater序列信息，默认通用引物
-q --quiet        安静运行

import os
commond = f'{fastqc_path} --extract -q -o {output_dir} {fastq1_path} {fastq2_path}'
os.system(commond)

生成文件：html 和 zip 文件

三、blat （fasta文件 ------> psl 文件）

下载

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v385/blat/blat

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/blat/blat

用法

blat database query [-ooc=11.ooc] output.psl
输入文件.fa, .nib, .2bit file
-t=type                数据库类型：dna(default)  /  prot(protein)  /  dnax(6种）
-q=type                输入类型：dna  /  rna  /  prot  /  dnax  /  rnax
-prot                  Synonymous with -t=prot -q=prot
-minIdentity=N         最小序列相似度，默认dna/rna 90,蛋白，转录本25
-out=type              输出文件格式：psl(默认) / pslx / axt / maf / sim4 / 
                                    wublast / blast / blast8 / blast9

commond = f'{blat_software} -minIdentity=80 -out=psl {database} {fasta_file} {blat_psl}'

针对psl文件表现出的比对情况，进行整理，过滤（同源性低reads，剪切掉引物等）

''' blat_psl file
match   mismatch   repmatch    N's       Q_gap_count   Q_gap_bases    T_gap_count    T_gap_bases
strand  Q_name     Q_size      Q_start   Q_end         T_name     T_size   T_start T_end   
block_count    blocksizes   qStarts tStarts
'''

四、bwa + sambamba （fastq文件 ------> bam 文件）

wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2
tar jxvf bwa-0.7.17.tar.bz2
cd bwa-0.7.17
make

BWA是一个软件包，用于将低差异序列映射到大型参考基因组（例如人类基因组）上，由三种算法组成：BWA-backtrack, BWA-SW 和 BWA-MEM。

bwa mem [ -aCHMpP ] [ -t nThreads ] [ -k minSeedLen ] [ -w bandWidth ] [ -d zDropoff ] 
        [ -r seedSplitRatio ] [ -c maxOcc ] [ -A matchScore ] [ -B mmPenalty ] [ -O gapOpenPen ] 
        [ -E gapExtPen ] [ -L clipPen ] [ -U unpairPen] [ -R RGline ] [ -v verboseLevel ] 
        db.prefix reads.fq [ mates.fq ]

-t  线程数[1]    *
-k  最小种子长度[19]
-w  带宽[100]
-d  离对角线的X-dropoff（Z-dropoff)[100]
-r  值越大，产生的种子越少，导致对齐速度更快，但准确性较低。[1.5]
-c  如果MEM在基因组中发生的次数超过INT，则将其丢弃 。[10000]
-P  在配对端模式下，执行SW仅可挽救丢失的匹配，但不要尝试查找适合合适的匹配的匹配。
-A  匹配分数。[1]
-B  不匹配的罚款[4]
-O  差距开放处罚[6]
-E  间隙延长罚分[1]
-L  罚款[5]
-U  未配对读对的处罚。[9]
-p  假定第一个输入查询文件是交错的成对配对的FASTA / Q。
-R  完成读取组标题行。'\t'可以在STR中使用， 并将在输出SAM中转换为TAB。读取组ID将附加到输出中的每个读取。
    一个示例是"@RG\tID:foo\tSM:bar"。[空值]    *    
-T  不要输出分数低于INT的对齐方式 [30]    *
-a  输出所有发现的单端或未配对的成对末端读段的比对。这些比对将被标记为次要比对。
-C  将FASTA / Q注释附加到SAM输出。此选项可用于将读取的元信息（例如条形码）传输到SAM输出。
    请注意，FASTA / Q注释（标题行中的空格之后的字符串）必须符合SAM规范（例如BC：Z：CGTAC）。
    格式错误的注释会导致不正确的SAM输出。
-H  在SAM输出中使用硬限幅“ H”。映射长重叠群或BAC序列时，此选项可能会大大减少输出的冗余。
-M  将较短的拆分匹配标记为次要（以实现Picard兼容性）。   *
-v  控制输出的详细级别。整个BWA均未完全支持此选项。[3]

sambamba

git clone --recursive https://github.com/lomereiter/sambamba.git 好似不行
wget -c https://github.com/biod/sambamba/archive/v0.6.9.tar.gz
cd sambamba && make sambamba-ldmd2-64

【sambamba-view】从SAM / BAM文件中提取信息的工具

sambamba view  [OPTIONS] <input.bam|input.sam> [region1 [...]]

-F, --filter=FILTER          设置自定义过滤器
-f, --format=FORMAT          输出格式   *
-h, --with-header            打印sam标头
-H， --header                仅打印sam标头
-I， --reference-info        以JSON格式输出stdout参考序列名称和长度    *
-c, --count                  仅将匹配记录的数量输出到stdout，-hHI选项将被忽略
-v, --valid                  仅输出有效reads
-S, --sam-input              指定输入SAM文件，默认情况下，该工具将BAM文件作为输入    *
-p, --show-progress          在终端中显示进度条，仅适用于BAM文件，没有指定区域，仅在读取完整文件时有效
-l, --compression-level=COMPRESSION_LEVEL    设置BAM输出的压缩级别，范围为0到9。
-o, --output-filename=FILENAME        指定输出文件名（默认情况下，所有内容均写入stdout） 
-t, --nthreads=NTHREADS      线程数

【sambamba-sort】用于对BAM文件进行排序的工具

sambamba sort OPTIONS <input.bam>

-m, --memory-limit=LIMIT            设置已用内存的上限   *
--tmpdir=/tmp                     使用TMPDIR输出排序的块。默认行为是使用系统临时目录   *
-o, --out=OUTPUTFILE                输出文件名。如未提供，则结果写入.sorted.bam文件中   *
-n, --sort-by-name                  按读取名称排序，而不是进行坐标排序
-l, --compression-level=COMPRESSION_LEVEL     用于排序的BAM的压缩级别，从0到9
-u, --uncompressed-chunks      将排序的块写为未压缩的BAM。默认使用压缩级别1
-p, --show-progress  显示类似wget的进度条（实际上，其中两个接一个，第一个用于排序，然后另一个用于合并）
-t, --nthreads=NTHREADS    线程数   *

注：在Linux系统中，标准输入/dev/stdin , 标准输出/dev/stdout
    sambamba 把 sam 文件可以输入到 /dev/stdin 文件中，再将文件中的内容整理输出到 bam 文件中

bwa 可以和 sambamba 联合起来用，将 bwa 的结果 sam文件直接用sambamba处理成 bam
bwa_commond = bwa mem  -R "@RG\tID:{sample}\tSM{sample}\tLB:WGS\tPL:Illumina" ' genome_database fastq_file | sambamba view -S -f bam /dev/stdin |sambamba sort --tmpdir=/tmp -o bam_file /dev/stdin'
看情况添加其他选项，包括线程数，占用内存啥的

五、samtools + bcftools （bam文件 ------> pileup/vcf 文件）

samtools

wget https://sourceforge.net/projects/samtools/files/samtools/1.12/samtools-1.12.tar.bz2/download

【samtools】序列比对/图谱（SAM）格式的实用程序

samtools view [*options*] *in.sam*|*in.bam*|*in.cram* [*region*...]

-b                           以BAM格式输出
-o                           输出到文件[stdout]
-u                           输出未压缩的BAM,可节省压缩/解压缩时间，当将输出通过管道传递到另一个samtools命令时，该选项是首选的。

【samtools mpileup】从对齐方式生成“ pileup”文本格式

samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

-A，-count-orphans            不要在变体调用中跳过异常的读取对
-b，-bam-list                 输入的BAM文件列表，每行一个文件[null]
-B，--no-BAQ                  禁用基本比对质量（BAQ）计算，BAQ是读取碱基未对齐的Phred标度概率
-Q，--min-BQ                  最低基准质量
-q，--min-MQ                  最低mapping质量
-f，-fasta-ref                FAID格式 的 faidx索引参考文件
...

Samtools mpileup仍然可以产生VCF和BCF输出（带有 -g或-u），但是不建议使用此功能，并且在将来的版本中将删除该功能。请使用 bcftools mpileup（2021年3月17日发行的samtools-1.12手册页）

例：call SNPs and short indels

samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf  > var.flt.vcf

获取测序深度
samtools depth -r chr12:126073855-126073965  Ip.sorted.bam
chr12    126073855    5
chr12    126073856    15

bcftools

git clone --recurse-submodules git://github.com/samtools/htslib.git
git clone git://github.com/samtools/bcftools.git
cd bcftools
 # The following is optional:
 #   autoheader && autoconf && ./configure --enable-libgsl --enable-perl-filters
make
为了使用BCFtools插件，必须设置此环境变量并指向正确的位置：
export BCFTOOLS_PLUGINS=/path/to/bcftools/plugins

【bcftools】用于变体调用和操纵VCF和BCF的实用程序

bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]

六、freebayes （bam文件 ------> vcf 文件）

git clone --recursive git://github.com/ekg/freebayes.git
make

freebayes [option]  bam.files > vcf.files

-b --bam                            将文件添加到要分析的BAM文件集
-f --fasta-reference                分析所需参考序列
-t --targets                        对bed文件中目标进行分析
-A --cnv-map                        从bed文件中读取cnv  mapping

-v --vcf                            输出vcf格式

-C --min-alternate-count            最少 alt 碱基数，默认2
-F --min-alternate-fraction         最少 alt 碱基比例，双倍体默认 0.2
--genotype-qualities                在输出的vcf文件中有GQ
--min-mapping-quality               默认1
--min-base-quality                  默认0
...
freebayes -h                        查看文档

freebayes --fasta-reference genome_db --bam example.bam  --min-alternate-count 2 --min-alternate-fraction 0.2 --min-base-quality 20 --min-mapping-quality 1 > result.vcf

七、annovar （vcf文件 ------>avinput ------> 不同文件，关于注释）

下载需提供邮箱
tar xvfz annovar.latest.tar.gz

annotate_variation.pl                       主程序，下载数据库，注释
coding_change.pl                            推断蛋白质序列
convert2annovar.pl                          将多种格式转为.avinput
retrieve_seq_from_fasta.pl                  自行建立其他物种的转录本
table_annovar.pl                            注释程序
variants_reduction.pl                       灵活定制过滤注释流程
humandb                                     人类注释数据库

下载数据库https://annovar.openbioinformatics.org/en/latest/user-guide/download/
perl annotate_variation.pl -downdb avdblist -buildver hg38 -webfrom annovar path/to/humandb

-format vcf4                  可用于将VCF文件转换为ANNOVAR输入格式
convert2annovar.pl  -format vcf4    example.vcf > ex2.avinput
chr  start  end  ref  alt

# 基于基因的注释  https://annovar.openbioinformatics.org/en/latest/user-guide/gene/
生成两个输出文件: ex1.variant_function （包含所有变体的注释）
                 ex1.exonic_variant_function（包含由于外显子变体导致的氨基酸变化）
（要更改输出文件名，请使用--outfile参数）
# 基于区域的注释  https://annovar.openbioinformatics.org/en/latest/user-guide/region/
annotate_variation.pl --regionanno
    annotate_variation.pl -downdb phastConsElements46way humandb/
    输出保存在phastConsElements46way文件
# 基于过滤器的注释  https://annovar.openbioinformatics.org/en/latest/user-guide/filter/
annotate_variation.pl  --filter -dbtype database_name  --buildver genome_version -out 输出文件开头

    已知的变体将与等位基因频率一起写入_dropped文件
    没有匹配的数据库条目的变量将被写入_filtered文件

八、Intervar

InterVar是用于临床意义变异解释的python脚本
同作者关于Intervar

python Intervar.py -b genome_version  -i example.bed  --input_type=AVinput -d /InterVar_HumanDB -o output/dir'

九、分析（vcf 及其各项注释）

1、突变频谱图
2、突变热图
3、基因组疤痕评分
4、突变负荷
5、驱动突变
6、罕见突变与融合基因检测
7、突变过程及其突变信号
8、功能富集分析
（1）基因集富集分析
（2）路径分析和可视化
（3）网络分析

基本分析逻辑

测序数据简单分析流程

一、测序数据下载

二、fastqc （fastq文件）

（一）、下载和安装

（二）、在 python 中使用

三、blat （fasta文件 ------> psl 文件）

四、bwa + sambamba （fastq文件 ------> bam 文件）

五、samtools + bcftools （bam文件 ------> pileup/vcf 文件）

六、freebayes （bam文件 ------> vcf 文件）

七、annovar （vcf文件 ------>avinput ------> 不同文件，关于注释）

八、Intervar

九、分析（vcf 及其各项注释）

猜你喜欢

热点阅读

测序数据简单分析流程

一、测序数据下载

二、fastqc （fastq文件）

（一）、下载和安装

（二）、在 python 中 使用

三、blat （fasta文件 ------> psl 文件）

四、bwa + sambamba （fastq文件 ------> bam 文件）

五、samtools + bcftools （bam文件 ------> pileup/vcf 文件）

六、freebayes （bam文件 ------> vcf 文件）

七、annovar （vcf文件 ------>avinput ------> 不同文件，关于注释）

八、Intervar

九、分析 （vcf 及其各项注释）

猜你喜欢

热点阅读

（二）、在 python 中使用

九、分析（vcf 及其各项注释）