kneaddata安装及使用
2021-03-31 本文已影响0人
斗战胜佛oh
简介
【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。
【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。
【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。
KneadData是一款宏基因组和宏转录组测序数据质控的流程,其主要功能包括使用Trimmomatic序列质控,bowtie2比对至对应数据库基因组去除宿主等序列。
安装
## 构建虚拟环境,安装kneaddata
conda create -n kneaddata
conda activate kneaddata
conda install -c biobakery kneaddata
下载数据库
mkdir kneaddata_database
cd kneaddata_database/
kneaddata_database --download human_genome bowtie2 ./
kneaddata_database --download mouse_C57BL bowtie2 ./
kneaddata_database --download human_transcriptome bowtie2 ./
kneaddata_database --download ribosomal_RNA bowtie2 ./
## 分别解压数据库文件到自定义目录中
mkdir kneaddata_db_DATABASE_NAME
tar -zxvf DATABASE_NAME.tar.gz -C ./kneaddata_db_DATABASE_NAME/
创建定制数据库(可选)
# bowtie2-build <reference> <db-name>
mkdir kneaddata_db_Rnor_6
cd kneaddata_db_Rnor_6
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0/GCF_000001895.5_Rnor_6.0_genomic.fna.gz
bowtie2-build GCF_000001895.5_Rnor_6.0_genomic.fna.gz Rnor_6 --threads 8
运行
## 指定数据库,可以全列出来,也可以用到某个就写某个
KNEADDATA_DB_MOUSE_C57BL_6NJ=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_mouse_C57BL_6NJ/mouse_C57BL_6NJ
KNEADDATA_DB_HUMAN_GENOME=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_human_genome/Homo_sapiens
KNEADDATA_DB_HUMAN_TRANSCRIPTOME=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_human_transcriptome/human_hg38_refMrna
KNEADDATA_DB_RIBOSOMAL_RNA=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_ribosomal_rna/SILVA_128_LSUParc_SSUParc_ribosomal_RNA
KNEADDATA_DB_RNOR_6=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_Rnor_6/Rnor_6
## 单端数据
kneaddata -i read-1.fastq.gz -o ./result-1 -t 20 -p 20 -db $KNEADDATA_DB_HUMAN_GENOME
## 双端数据
## 如果是使用多个数据库进行质控,建议添加‘--serial’参数,会将多个数据库的输入输出串联起来,同时建议添加‘--cat-final-output’参数,合并最终输出文件,便于后续分析
kneaddata -i 36_1.out.fq.gz -i 36_2.out.fq.gz -o ./kneaddata_36 --output-prefix 36 -t 20 -p 20 -db /work_directory/work_result/kneaddata_database/kneaddata_db_DATABASE_NAME --trimmomatic /soft_directory/Software/Trimmomatic-0.39
质控后结果统计
跑完质控后,还可以对跑完后的数据记性质控过程统计。使用kneaddata_read_count_table功能,输入文件是输出目录中的log文件,里面记录了质控过程。如果是多个独立样本,可以将多个样本的log文件汇总在同一个目录下,对目录下所有log文件进行汇总统计。
kneaddata_read_count_table --input log_file/ --output kneaddata_read_count_table.tsv
结果:
列名中含有“pair”表示配对的reads数,“orphan”表示过滤后不成对的reads数。
Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1 decontaminated Homo_sapiens pair2 decontaminated Rnor_6 pair1 decontaminated Rnor_6 pair2 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA pair1 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA pair2 decontaminated Homo_sapiens orphan1 decontaminated Homo_sapiens orphan2 decontaminated Rnor_6 orphan1 decontaminated Rnor_6 orphan2 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA orphan1 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA orphan2 final pair1 final pair2 final orphan1 final orphan2
Sample1 34911652.0 34911652.0 34908958.0 34908958.0 400.0 2294.0 34605696.0 34605696.0 34602788.0 34602788.0 34602777.0 34602777.0 111066.0 114822.0 111600.0 115489.0 111605.0 115495.0 34602777.0 34602777.0 110989.0 114745.0
Sample2 34680773.0 34680773.0 34678094.0 34678094.0 369.0 2310.0 34380372.0 34380372.0 34378883.0 34378883.0 34378878.0 34378878.0 109089.0 112894.0 109539.0 113322.0 109542.0 113324.0 34378878.0 34378878.0 109076.0 112872.0
附录-参数详情
usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR [-db REFERENCE_DB] [--bypass-trim] [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>] [-q {phred33,phred64}] [--run-bmtagger] [--bypass-trf] [--run-fastqc-start]
[--run-fastqc-end] [--store-temp-output] [--remove-intermediate-output] [--cat-final-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG] [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
[--trimmomatic-options TRIMMOMATIC_OPTIONS] [--sequencer-source {NexteraPE,TruSeq2,TruSeq3}] [--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS] [--no-discordant] [--reorder] [--serial]
[--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH] [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI] [--minscore MINSCORE] [--maxperiod MAXPERIOD] [--fastqc FASTQC_PATH]
KneadData
optional arguments:
-h, --help show this help message and exit
-v, --verbose additional output is printed
global options:
--version show program's version number and exit
-i INPUT, --input INPUT
input FASTQ file (add a second argument instance to run with paired input files)
-o OUTPUT_DIR, --output OUTPUT_DIR
directory to write output files
-db REFERENCE_DB, --reference-db REFERENCE_DB
location of reference database (additional arguments add databases)
--bypass-trim bypass the trim step
--output-prefix OUTPUT_PREFIX
prefix for all output files
[ DEFAULT : $SAMPLE_kneaddata ]
-t <1>, --threads <1>
number of threads
[ Default : 1 ]
-p <1>, --processes <1>
number of processes
[ Default : 1 ]
-q {phred33,phred64}, --quality-scores {phred33,phred64}
quality scores
[ DEFAULT : phred33 ]
--run-bmtagger run BMTagger instead of Bowtie2 to identify contaminant reads
--bypass-trf option to bypass the removal of tandem repeats
--run-fastqc-start run fastqc at the beginning of the workflow
--run-fastqc-end run fastqc at the end of the workflow
--store-temp-output store temp output files
[ DEFAULT : temp output files are removed ]
--remove-intermediate-output
remove intermediate output files
[ DEFAULT : intermediate output files are stored ]
--cat-final-output concatenate all final output files
[ DEFAULT : final output is not concatenated ]
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
level of log messages
[ DEFAULT : DEBUG ]
--log LOG log file
[ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ]
trimmomatic arguments:
--trimmomatic TRIMMOMATIC_PATH
path to trimmomatic
[ DEFAULT : $PATH ]
--max-memory MAX_MEMORY
max amount of memory
[ DEFAULT : 500m ]
--trimmomatic-options TRIMMOMATIC_OPTIONS
options for trimmomatic
[ DEFAULT : ILLUMINACLIP:/-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 ]
MINLEN is set to 50 percent of total input read length
--sequencer-source {NexteraPE,TruSeq2,TruSeq3}
options for sequencer-source
[ DEFAULT : NexteraPE]
bowtie2 arguments:
--bowtie2 BOWTIE2_PATH
path to bowtie2
[ DEFAULT : $PATH ]
--bowtie2-options BOWTIE2_OPTIONS
options for bowtie2
[ DEFAULT : --very-sensitive ]
--no-discordant do not include discordant alignments for pairs (ie one of the two pairs aligns)
[ DEFAULT : Discordant alignments are included ]
--reorder order the sequences in the same order as the input
[ DEFAULT : With discordant paired alignments sequences are not ordered ]
--serial filter the input in serial for multiple databases so a subset of reads are processed in each database search
bmtagger arguments:
--bmtagger BMTAGGER_PATH
path to BMTagger
[ DEFAULT : $PATH ]
trf arguments:
--trf TRF_PATH path to TRF
[ DEFAULT : $PATH ]
--match MATCH matching weight
[ DEFAULT : 2 ]
--mismatch MISMATCH mismatching penalty
[ DEFAULT : 7 ]
--delta DELTA indel penalty
[ DEFAULT : 7 ]
--pm PM match probability
[ DEFAULT : 80 ]
--pi PI indel probability
[ DEFAULT : 10 ]
--minscore MINSCORE minimum alignment score to report
[ DEFAULT : 50 ]
--maxperiod MAXPERIOD
maximum period size to report
[ DEFAULT : 500 ]
fastqc arguments:
--fastqc FASTQC_PATH path to fastqc
[ DEFAULT : $PATH ]
参考资料
https://github.com/biobakery/kneaddata
https://github.com/biobakery/biobakery/wiki/kneaddata
`