BS-seq生信分析工具包生信

nf-core 一个把各个组学分析流程化发了 NBT 的牛逼东西

2020-03-04  本文已影响0人  热衷组培的二货潜

链接:
The nf-core framework for community-curated bioinformatics pipelines

包括哪些流程?

链接:https://nf-co.re/pipelines

分析参考手册:

https://github.com/nf-core/methylseq/blob/master/docs/usage.md

# 输入文件
--reads 'path/to/data/sample_*_{1,2}.fastq' # 双端
--single_end --reads '*.fastq' #单端



# 参考基因组
## 内置的
人类:--genome GRCh37 和 --genome GRCh38
小鼠:--genome GRCm38

## 自己提供参考基因组序列
### Single multifasta for genome
--fasta /path/to/genome.fa

### Bismark index directory
--bismark_index /path/to/ref/BismarkIndex/

### bwa-meth index filename base
### where for example the index files are called:
### /path/to/ref/genome.fa.bwameth.c2t.bwt
--bwa_meth_index /path/to/ref/genome.fa

### Genome Fasta index file
--fasta_index /path/to/genome.fa.fai 

## 保存建立的索引信息
--save_reference



# 接头过滤设置
Parameter   5' R1 Trim  5' R2 Trim  3' R1 Trim  3' R2 Trim
--pbat                 6        9       6       9
--single_cell         6         6   6   6
--epignome        8             8   8   8
--accel           10    15  10  10
--zymo          10  15  10  10
--cegx          6   6   2   2

--rrbs # 征对于 RRBS 建库
--pbat # PBAT 建立,互补链,与参数 --directional (bismark 默认的方式)相反
--non_directional # 输出四种可能的链,注意 --single_cell 和 --zymo 模式自动设置了此参数
--skip_trimming # 如果输入文件为 clean data 可以此参数跳过这一步

--comprehensive # 输出 CHG/CHH 信息

--relax_mismatches and --num_mismatches # Bismark 默认 --num_mismatches 为 0.2,--relax_mismatches 则为 0.6;
# 0.6 will allow a penalty of bp * -0.6 - for 100bp reads, this is -60. Mismatches cost -6, gap opening -5 and gap extension -2. So, -60 would allow 10 mismatches or ~ 8 x 1-2bp indels.

--unmapped # 保存为比对上的 reads
--save_trimmed # 保存过滤掉的 reads

--min_depth # MethylDackel 输出甲基化信息掉的最小覆盖度
--meth_cutoff # Bimark 中 bismark_methylation_extractor 提取甲基化信息的最小覆盖度

--ignore_flags:MethylDackel 处理 BAM 文件时候忽略 SAM 文件 flag 值
--methyl_kit:MethylDackel 输出结果可直接输入 R 包 methylKit 进行后续分析

--known_splices:Bismark 运行 only works with `--aligner bismark_hisat`

--slamseq:Specify to run Bismark with the `--slam` flag to run bismark in SLAM-seq mode (only works with `--aligner bismark_hisat`)

--local_alignment:Specify to run Bismark with the --local flag to allow soft-clipping of reads. This should only be used with care in certain single-cell applications or PBAT libraries, which may produce chimeric read pairs

--max_memory:最大内存,例如 --max_memory '8.GB'
--max_cpus:--max_cpus 1


主要分析步骤提取

参数设置地方:

parameters.settings.json

00、基因组建立 index

# 第一种:Bismark
## $aligner 有两种:'--hisat2' : '--bowtie2',注意这两的 index 不能互用
aligner = params.aligner == 'bismark_hisat' ? '--hisat2' : '--bowtie2' # 比对软件
slam = params.slamseq ? '--slam' : '' # 测序类型

mkdir BismarkIndex
cp $fasta BismarkIndex/
bismark_genome_preparation $aligner $slam BismarkIndex

# 第二种:bwameth
bwameth.py index $fasta

插播:Bismark 比对参数
https://github.com/FelixKrueger/Bismark
Bismark alignment modes

STEP 1 - FastQC

fastqc --quiet --threads $task.cpus $reads

STEP 2 - Trim Galore!

说明书:https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md

5' Trimming can be accomplished with Trim Galore using:

--clip_r1 <NUMBER> (Read 1) or

--clip_r2 <NUMBER> (Read 2)

3' Trimming can be accomplished with Trim Galore using:

--three_prime_clip_r1 <NUMBER> (Read 1) or

--three_prime_clip_r2 <NUMBER> (Read 2).
c_r1 = params.clip_r1 > 0 ? "--clip_r1 ${params.clip_r1}" : ''
c_r2 = params.clip_r2 > 0 ? "--clip_r2 ${params.clip_r2}" : ''
tpc_r1 = params.three_prime_clip_r1 > 0 ? "--three_prime_clip_r1 ${params.three_prime_clip_r1}" : ''
tpc_r2 = params.three_prime_clip_r2 > 0 ? "--three_prime_clip_r2 ${params.three_prime_clip_r2}" : ''
rrbs = params.rrbs ? "--rrbs" : '' # 有几种模式:`--rrbs`/`--non_directional`
# $reads 输入的 fq 文件
# 单端
trim_galore --fastqc --gzip $rrbs $c_r1 $tpc_r1 $reads

# 双端
trim_galore --paired --fastqc --gzip $rrbs $c_r1 $c_r2 $tpc_r1 $tpc_r2 $reads

使用 Bismark 流程

STEP 3.1 - align with Bismark

aligner = params.aligner == "bismark_hisat" ? "--hisat2" : "--bowtie2"
pbat = params.pbat ? "--pbat" : ''
non_directional = params.single_cell || params.zymo || params.non_directional ? "--non_directional" : ''
unmapped = params.unmapped ? "--unmapped" : ''
mismatches = params.relax_mismatches ? "--score_min L,0,-${params.num_mismatches}" : ''
soft_clipping = params.local_alignment ? "--local" : ''
splicesites = params.aligner == "bismark_hisat" && knownsplices.name != 'null' ? "--known-splicesite-infile <(hisat2_extract_splice_sites.py ${knownsplices})" : ''
multicore = "--multicore $ccore" # 这一步有根据的你的服务器计算能力计算,详情看原文, $ccore 表示线程数

bismark $input \\
            $aligner \\
            --bam $pbat $non_directional $unmapped $mismatches $multicore \\
            --genome $index \\
            $reads \\
            $soft_clipping \\
            $splicesites

STEP 4 - Bismark deduplicate

fq_type = params.single_end ? '-s' : '-p' # 单端 `-s`; 双端 `-p`

deduplicate_bismark $fq_type --bam $bam

STEP 5 - Bismark methylation extraction

插播:https://github.com/FelixKrueger/Bismark/tree/master/Docs

comprehensive = params.comprehensive ? '--comprehensive --merge_non_CpG' : ''
meth_cutoff = params.meth_cutoff ? "--cutoff ${params.meth_cutoff}" : ''
# 测试 CPU
ccore = ((task.cpus as int) / 10) as int
multicore = "--multicore $ccore"
# 测试内存
mbuffer = (task.memory as nextflow.util.MemoryUnit) - 2.GB  
buffer = "--buffer_size ${mbuffer.toGiga()}G"   # if( mbuffer.compareTo(4.GB) == 1 )

## 如果是植物请加上 --CX 不然不会输出 CHG\CHH 甲基化信息
# 单端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
                $multicore $buffer \\
                --bedGraph \\
                --counts \\
                --gzip \\
                -s \\
                --report \\
                $bam


# 双端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
                $multicore $buffer \\
                --ignore_r2 2 \\
                --ignore_3prime_r2 2 \\
                --bedGraph \\
                --counts \\
                --gzip \\
                -p \\
                --no_overlap \\
                --report \\
                $bam

STEP 6 - Bismark Sample Report

# set val(name), file(align_log), file(dedup_log), file(splitting_report), file(mbias) from ch_bismark_logs_for_bismark_report

bismark2report \\
            --alignment_report $align_log \\
            --dedup_report $dedup_log \\
            --splitting_report $splitting_report \\
            --mbias_report $mbias

STEP 7 - Bismark Summary Report

bismark2summary # 直接运行此就行

使用 bwa-meth 流程

STEP 3.1 - align with bwa-meth

fasta = bwa_meth_indices[0].toString() - '.bwameth' - '.c2t' - '.amb' - '.ann' - '.bwt' - '.pac' - '.sa'
prefix = reads[0].toString() - ~/(_R1)?(_trimmed)?(_val_1)?(\.fq)?(\.fastq)?(\.gz)?$/

bwameth.py \\
            --threads ${task.cpus} \\
            --reference $fasta \\
            $reads | samtools view -bS - > ${prefix}.bam

STEP 4.- samtools flagstat on samples

samtools sort $bam \\
            -@ ${task.cpus} $sort_mem \\
            -o ${bam.baseName}.sorted.bam

samtools index ${bam.baseName}.sorted.bam
samtools flagstat ${bam.baseName}.sorted.bam > ${bam.baseName}_flagstat_report.txt
samtools stats ${bam.baseName}.sorted.bam > ${bam.baseName}_stats_report.txt

STEP 5 - Mark duplicates with picard

picard -Xmx${avail_mem}g MarkDuplicates \\
                INPUT=$bam \\
                OUTPUT=${bam.baseName}.markDups.bam \\
                METRICS_FILE=${bam.baseName}.markDups_metrics.txt \\
                REMOVE_DUPLICATES=false \\
                ASSUME_SORTED=true \\
                PROGRAM_RECORD_ID='null' \\
                VALIDATION_STRINGENCY=LENIENT

samtools index ${bam.baseName}.markDups.bam

STEP 6 - extract methylation with MethylDackel

all_contexts = params.comprehensive ? '--CHG --CHH' : ''
min_depth = params.min_depth > 0 ? "--minDepth ${params.min_depth}" : ''
ignore_flags = params.ignore_flags ? "--ignoreFlags" : ''
methyl_kit = params.methyl_kit ? "--methylKit" : ''

MethylDackel extract $all_contexts $ignore_flags $methyl_kit $min_depth $fasta $bam
MethylDackel mbias $all_contexts $ignore_flags $fasta $bam ${bam.baseName} --txt > ${bam.baseName}_methyldackel.txt

STEP 8 - Qualimap

samtools sort $bam \\
        -@ ${task.cpus} $sort_mem \\
        -o ${bam.baseName}.sorted.bam

qualimap bamqc $gcref \\
        -bam ${bam.baseName}.sorted.bam \\
        -outdir ${bam.baseName}_qualimap \\
        --collect-overlap-pairs \\
        --java-mem-size=${task.memory.toGiga()}G \\
        -nt ${task.cpus}

STEP 9 - preseq

samtools sort $bam \\
        -@ ${task.cpus} $sort_mem \\
        -o ${bam.baseName}.sorted.bam

preseq lc_extrap -v -B ${bam.baseName}.sorted.bam -o ${bam.baseName}.ccurve.txt

STEP 10 - MultiQC

rtitle = custom_runName ? "--title \"$custom_runName\"" : ''
rfilename = custom_runName ? "--filename " + custom_runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report" : ''

multiqc -f $rtitle $rfilename --config $multiqc_config . \\
        -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

STEP 11 - Output Description HTML

上一篇下一篇

猜你喜欢

热点阅读