khaper 去除基因组重复
2022-07-10 本文已影响0人
斩毛毛
该工具主要用于去除基因组重复的序列。
详情请查看githup
Note
- 软件需要内存 >100 G
- 该软件要求基因组大小 < 4G
简单使用
基本思路:将>40x的Illumina reads构建kmer 频率表,然后基于该频率表去除基因组重复的序列。
1.直接从githup下载即可,需要jellyfish,使用conda安装就可以
conda install -c bioconda jellyfish
2. 数据准备
- assemble.fasta
- reads1.gz
- reads2.gz
3. 构建kmer频率表
ls *.gz > fq.lst
perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 15 -s 1,3 -d Kmer_17
# 参数:
-m 最小kmer出现次数
-i fastq文件
-k kmer 大小
-d 输入文件
-s 如下所示
1: count k-mer by jellyfish
2: record unique k-mer into .h5 file
3: record unique k-mer into .bit file
4: record all k-mer into .h5 file
5: record all k-mer into .bit file
6: record all kmer into .bit with -m is 0.5 the peak
7: get the genome size, repeate rate and hete rate
注意:
# For k=17, we recommend:
perl Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3,5 -d Kmer_17
# For k>17, we recommend:
perl Graph.pl pipe -i fq.lst -m 2 -k 23 -s 1,2,4 -d Kmer_23
#######################################
k=15 is suitable for genome with size <100M.
k=17 is suitable for genome with size <10G.
This version is only support k<=17.
上述结果位于Kmer_17/02.Uinque_bit/kmer_17.bit
4. 去除基因组重复序列
perl remDup.pl <genome.fa> <outdir> <cutoff:0.7>
Options:
--ref <str> The ref genome to build kbit
--kbit <str> The unique kmer file
--kmer <int> the kmer size [15]
--sort <int> sort seq by length [1]
如下命令
perl Bin/remDup.pl --kbit Kmer_17/02.Uinque_bit/kmer_17.bit \
--kmer 17 assemble.fasta Compress 0.3
结果位于:compress file: Compress/trinity.single.fasta.gz
注意:
a. If the compress file is larger than estimated genome size, turn down the cutoff value
b. If the compress file is small than estimated genome size, turn up the cutoff value