nf-core 一个把各个组学分析流程化发了 NBT 的牛逼东西
2020-03-04 本文已影响0人
热衷组培的二货潜
链接:
The nf-core framework for community-curated bioinformatics pipelines
包括哪些流程?
- nascent :Nascent Transcription Processing Pipeline
- proteomicslfq:UNDER CONSTRUCTION - Proteomics label-free quantification (LFQ) analysis pipeline
- RNA-seq:RNA sequencing analysis pipeline using STAR, HISAT2 and Salmon with gene counts and quality control
- ampliseq: 16S rRNA amplicon sequencing analysis workflow using QIIME2
- CAGE-seq: CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.
- nanoseq: UNDER CONSTRUCTION: Nanopore demultiplexing, QC and alignment pipeline
- sRNA-seq: A small-RNA sequencing analysis pipeline
- sarek: Analysis pipeline to detect germline or somatic variants from WGS / targeted sequencing
- bacass: Simple bacterial assembly and annotation pipeline
- ChIP-seq: ChIP-seq peak-calling, QC and differential analysis pipeline.
- ATAC-seq: ATAC-seq peak-calling, QC and differential analysis pipeline
- MNase-seq: UNDER CONSTRUCTION: MNase-seq analysis pipeline using BWA and DANPOS2.
- SmartSeq2: A pipeline for processing single cell RNA-seq data generated with the SmartSeq2 protocol.
- scrnaseq: A single-cell RNAseq pipeline for 10X genomics data
- RNAfusion: RNA-seq analysis pipeline for detection gene-fusions
- BS-seq: Methylation (Bisulfite-Sequencing) analysis pipeline using Bismark or bwa-meth + MethylDackel
- lncpipe: UNDER DEVELOPMENT--- Analysis of long non-coding RNAs from RNA-seq datasets
- Hi-C: Analysis of Chromosome Conformation Capture data (Hi-C)
- GUIDE-Seq: UNDER CONSTRUCTION - CRISPR-Cas off-target identification using GUIDE-Seq
- slamseq: UNDER CONSTRUCTION: Slam-seq processing and analysis pipeline
- hlatyping: Precision HLA typing from next-generation sequencing data
-
exoseq : Please consider using/contributing to https://github.com/nf-core/sarek
.....
以上仅列出我想列出的流程,还有很多。下面我拆分一个流程(BS-seq),看看主要步骤:
从上图可以看到主要分为两种方法。一种从 Bismark,一种用 bwa-meth。
分析参考手册:
https://github.com/nf-core/methylseq/blob/master/docs/usage.md
# 输入文件
--reads 'path/to/data/sample_*_{1,2}.fastq' # 双端
--single_end --reads '*.fastq' #单端
# 参考基因组
## 内置的
人类:--genome GRCh37 和 --genome GRCh38
小鼠:--genome GRCm38
## 自己提供参考基因组序列
### Single multifasta for genome
--fasta /path/to/genome.fa
### Bismark index directory
--bismark_index /path/to/ref/BismarkIndex/
### bwa-meth index filename base
### where for example the index files are called:
### /path/to/ref/genome.fa.bwameth.c2t.bwt
--bwa_meth_index /path/to/ref/genome.fa
### Genome Fasta index file
--fasta_index /path/to/genome.fa.fai
## 保存建立的索引信息
--save_reference
# 接头过滤设置
Parameter 5' R1 Trim 5' R2 Trim 3' R1 Trim 3' R2 Trim
--pbat 6 9 6 9
--single_cell 6 6 6 6
--epignome 8 8 8 8
--accel 10 15 10 10
--zymo 10 15 10 10
--cegx 6 6 2 2
--rrbs # 征对于 RRBS 建库
--pbat # PBAT 建立,互补链,与参数 --directional (bismark 默认的方式)相反
--non_directional # 输出四种可能的链,注意 --single_cell 和 --zymo 模式自动设置了此参数
--skip_trimming # 如果输入文件为 clean data 可以此参数跳过这一步
--comprehensive # 输出 CHG/CHH 信息
--relax_mismatches and --num_mismatches # Bismark 默认 --num_mismatches 为 0.2,--relax_mismatches 则为 0.6;
# 0.6 will allow a penalty of bp * -0.6 - for 100bp reads, this is -60. Mismatches cost -6, gap opening -5 and gap extension -2. So, -60 would allow 10 mismatches or ~ 8 x 1-2bp indels.
--unmapped # 保存为比对上的 reads
--save_trimmed # 保存过滤掉的 reads
--min_depth # MethylDackel 输出甲基化信息掉的最小覆盖度
--meth_cutoff # Bimark 中 bismark_methylation_extractor 提取甲基化信息的最小覆盖度
--ignore_flags:MethylDackel 处理 BAM 文件时候忽略 SAM 文件 flag 值
--methyl_kit:MethylDackel 输出结果可直接输入 R 包 methylKit 进行后续分析
--known_splices:Bismark 运行 only works with `--aligner bismark_hisat`
--slamseq:Specify to run Bismark with the `--slam` flag to run bismark in SLAM-seq mode (only works with `--aligner bismark_hisat`)
--local_alignment:Specify to run Bismark with the --local flag to allow soft-clipping of reads. This should only be used with care in certain single-cell applications or PBAT libraries, which may produce chimeric read pairs
--max_memory:最大内存,例如 --max_memory '8.GB'
--max_cpus:--max_cpus 1
主要分析步骤提取
参数设置地方:
00、基因组建立 index
# 第一种:Bismark
## $aligner 有两种:'--hisat2' : '--bowtie2',注意这两的 index 不能互用
aligner = params.aligner == 'bismark_hisat' ? '--hisat2' : '--bowtie2' # 比对软件
slam = params.slamseq ? '--slam' : '' # 测序类型
mkdir BismarkIndex
cp $fasta BismarkIndex/
bismark_genome_preparation $aligner $slam BismarkIndex
# 第二种:bwameth
bwameth.py index $fasta
插播:Bismark 比对参数
https://github.com/FelixKrueger/Bismark
Bismark alignment modes
STEP 1 - FastQC
fastqc --quiet --threads $task.cpus $reads
STEP 2 - Trim Galore!
说明书:https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md
5' Trimming can be accomplished with Trim Galore using:
--clip_r1 <NUMBER> (Read 1) or
--clip_r2 <NUMBER> (Read 2)
3' Trimming can be accomplished with Trim Galore using:
--three_prime_clip_r1 <NUMBER> (Read 1) or
--three_prime_clip_r2 <NUMBER> (Read 2).
c_r1 = params.clip_r1 > 0 ? "--clip_r1 ${params.clip_r1}" : ''
c_r2 = params.clip_r2 > 0 ? "--clip_r2 ${params.clip_r2}" : ''
tpc_r1 = params.three_prime_clip_r1 > 0 ? "--three_prime_clip_r1 ${params.three_prime_clip_r1}" : ''
tpc_r2 = params.three_prime_clip_r2 > 0 ? "--three_prime_clip_r2 ${params.three_prime_clip_r2}" : ''
rrbs = params.rrbs ? "--rrbs" : '' # 有几种模式:`--rrbs`/`--non_directional`
# $reads 输入的 fq 文件
# 单端
trim_galore --fastqc --gzip $rrbs $c_r1 $tpc_r1 $reads
# 双端
trim_galore --paired --fastqc --gzip $rrbs $c_r1 $c_r2 $tpc_r1 $tpc_r2 $reads
使用 Bismark 流程
STEP 3.1 - align with Bismark
aligner = params.aligner == "bismark_hisat" ? "--hisat2" : "--bowtie2"
pbat = params.pbat ? "--pbat" : ''
non_directional = params.single_cell || params.zymo || params.non_directional ? "--non_directional" : ''
unmapped = params.unmapped ? "--unmapped" : ''
mismatches = params.relax_mismatches ? "--score_min L,0,-${params.num_mismatches}" : ''
soft_clipping = params.local_alignment ? "--local" : ''
splicesites = params.aligner == "bismark_hisat" && knownsplices.name != 'null' ? "--known-splicesite-infile <(hisat2_extract_splice_sites.py ${knownsplices})" : ''
multicore = "--multicore $ccore" # 这一步有根据的你的服务器计算能力计算,详情看原文, $ccore 表示线程数
bismark $input \\
$aligner \\
--bam $pbat $non_directional $unmapped $mismatches $multicore \\
--genome $index \\
$reads \\
$soft_clipping \\
$splicesites
STEP 4 - Bismark deduplicate
fq_type = params.single_end ? '-s' : '-p' # 单端 `-s`; 双端 `-p`
deduplicate_bismark $fq_type --bam $bam
STEP 5 - Bismark methylation extraction
comprehensive = params.comprehensive ? '--comprehensive --merge_non_CpG' : ''
meth_cutoff = params.meth_cutoff ? "--cutoff ${params.meth_cutoff}" : ''
# 测试 CPU
ccore = ((task.cpus as int) / 10) as int
multicore = "--multicore $ccore"
# 测试内存
mbuffer = (task.memory as nextflow.util.MemoryUnit) - 2.GB
buffer = "--buffer_size ${mbuffer.toGiga()}G" # if( mbuffer.compareTo(4.GB) == 1 )
## 如果是植物请加上 --CX 不然不会输出 CHG\CHH 甲基化信息
# 单端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
$multicore $buffer \\
--bedGraph \\
--counts \\
--gzip \\
-s \\
--report \\
$bam
# 双端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
$multicore $buffer \\
--ignore_r2 2 \\
--ignore_3prime_r2 2 \\
--bedGraph \\
--counts \\
--gzip \\
-p \\
--no_overlap \\
--report \\
$bam
STEP 6 - Bismark Sample Report
# set val(name), file(align_log), file(dedup_log), file(splitting_report), file(mbias) from ch_bismark_logs_for_bismark_report
bismark2report \\
--alignment_report $align_log \\
--dedup_report $dedup_log \\
--splitting_report $splitting_report \\
--mbias_report $mbias
STEP 7 - Bismark Summary Report
bismark2summary # 直接运行此就行
使用 bwa-meth 流程
STEP 3.1 - align with bwa-meth
fasta = bwa_meth_indices[0].toString() - '.bwameth' - '.c2t' - '.amb' - '.ann' - '.bwt' - '.pac' - '.sa'
prefix = reads[0].toString() - ~/(_R1)?(_trimmed)?(_val_1)?(\.fq)?(\.fastq)?(\.gz)?$/
bwameth.py \\
--threads ${task.cpus} \\
--reference $fasta \\
$reads | samtools view -bS - > ${prefix}.bam
STEP 4.- samtools flagstat on samples
samtools sort $bam \\
-@ ${task.cpus} $sort_mem \\
-o ${bam.baseName}.sorted.bam
samtools index ${bam.baseName}.sorted.bam
samtools flagstat ${bam.baseName}.sorted.bam > ${bam.baseName}_flagstat_report.txt
samtools stats ${bam.baseName}.sorted.bam > ${bam.baseName}_stats_report.txt
STEP 5 - Mark duplicates with picard
picard -Xmx${avail_mem}g MarkDuplicates \\
INPUT=$bam \\
OUTPUT=${bam.baseName}.markDups.bam \\
METRICS_FILE=${bam.baseName}.markDups_metrics.txt \\
REMOVE_DUPLICATES=false \\
ASSUME_SORTED=true \\
PROGRAM_RECORD_ID='null' \\
VALIDATION_STRINGENCY=LENIENT
samtools index ${bam.baseName}.markDups.bam
STEP 6 - extract methylation with MethylDackel
all_contexts = params.comprehensive ? '--CHG --CHH' : ''
min_depth = params.min_depth > 0 ? "--minDepth ${params.min_depth}" : ''
ignore_flags = params.ignore_flags ? "--ignoreFlags" : ''
methyl_kit = params.methyl_kit ? "--methylKit" : ''
MethylDackel extract $all_contexts $ignore_flags $methyl_kit $min_depth $fasta $bam
MethylDackel mbias $all_contexts $ignore_flags $fasta $bam ${bam.baseName} --txt > ${bam.baseName}_methyldackel.txt
STEP 8 - Qualimap
samtools sort $bam \\
-@ ${task.cpus} $sort_mem \\
-o ${bam.baseName}.sorted.bam
qualimap bamqc $gcref \\
-bam ${bam.baseName}.sorted.bam \\
-outdir ${bam.baseName}_qualimap \\
--collect-overlap-pairs \\
--java-mem-size=${task.memory.toGiga()}G \\
-nt ${task.cpus}
STEP 9 - preseq
samtools sort $bam \\
-@ ${task.cpus} $sort_mem \\
-o ${bam.baseName}.sorted.bam
preseq lc_extrap -v -B ${bam.baseName}.sorted.bam -o ${bam.baseName}.ccurve.txt
STEP 10 - MultiQC
rtitle = custom_runName ? "--title \"$custom_runName\"" : ''
rfilename = custom_runName ? "--filename " + custom_runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report" : ''
multiqc -f $rtitle $rfilename --config $multiqc_config . \\
-m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc