使用Unicycler进行细菌基因组组装
二代测序技术的迅速发展大幅度的降低了高通量测序价格且提高了测序质量,同时高通量测序相关的数据处理和分析的软件也是越来越好用。SPAdes在二代测序小基因组的组装上表现优异,使用方便,真是爱不释手。Unicycler在二代测序小基因组装上则是依赖于SPAdes,在此基础上进行了优化;此外在二代三代测序数据混合组装和纯三代测序数据组装上也表现较好;该项目在github上被标星210次。今天使用SPAdes和Unicycler对一个350bp小片段建库测序1G数据量的细菌菌株进行组装,熟悉一下流程并且看看两个软件的差别。
1. 数据质控
公司给的建库测序的数据分为raw_data和clean_data, 我先直接使用fastqc对clean_data的数据进行质控。双端测序数据是N46-1_add_BDMS190627350-1a_1.clean.fq.gz和N46-1_add_BDMS190627350-1a_2.clean.fq.gz。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ fastqc '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz' '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz' -o '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/fastqc_out'
正向序列 反向序列
fastqc结果看,测序长度是150bp,序列数量约为8M,数据量约为1.2G,序列GC含量和duplication水平不是特别完美,总体来看双端序列质量很好,数据的质量和数量满足下游分析。
2. 使用spades对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ spades.py -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz --careful -o spades_out -t 12 -m 24
Command line: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz --careful -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out -t 12 -m 24
System information:
SPAdes version: 3.13.1
Python version: 3.7.3
OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid
Output dir: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out
Mode: read error correction and assembling
Debug mode is turned OFF
Dataset parameters:
Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
Reads:
Library number: 1, library type: paired-end
orientation: fr
left reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz']
right reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz']
interlaced reads: not specified
single reads: not specified
merged reads: not specified
Read error correction parameters:
Iterations: 1
PHRED offset will be auto-detected
Corrected reads will be compressed
Assembly parameters:
k: automatic selection based on read length
Repeat resolution is enabled
Mismatch careful mode is turned ON
MismatchCorrector will be used
Coverage cutoff is turned OFF
Other parameters:
Dir for temp files: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/tmp
Threads: 12
Memory limit (in Gb): 24
======= SPAdes pipeline started. Log can be found here: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/spades.log
===== Read error correction started.
== Running read error correction tool: /home/czh/miniconda3/envs/denovo_genome/bin/spades-hammer /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info
0:00:00.000 4M / 4M INFO General (main.cpp : 75) Starting BayesHammer, built from N/A, git revision N/A
0:00:00.000 4M / 4M INFO General (main.cpp : 76) Loading config from /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info
0:00:00.000 4M / 4M INFO General (main.cpp : 78) Maximum # of threads to use (adjusted due to OMP capabilities): 12
0:00:00.000 4M / 4M INFO General (memory_limit.cpp : 49) Memory limit set to 24 Gb
0:00:00.000 4M / 4M INFO General (main.cpp : 86) Trying to determine PHRED offset
0:00:00.000 4M / 4M INFO General (main.cpp : 92) Determined value is 33
耗时大约70分钟。
3. 使用unicycler对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ unicycler -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o unicycler_out -t 12
Starting Unicycler (2019-11-03 22:46:05)
Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you
provided only short reads, Unicycler will essentially function as a SPAdes-
optimiser. It will try many k-mer sizes, choose the best based on contig length
and graph connectivity, and scaffold the graph using SPAdes repeat resolution.
For more information, please see https://github.com/rrwick/Unicycler
Command: /home/czh/miniconda3/envs/denovo_genome/bin/unicycler -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out -t 12
Unicycler version: v0.4.8
Using 12 threads
The output directory already exists:
/home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out
Dependencies:
Program Version Status
spades.py 3.13.1 good
racon not used
makeblastdb 2.9.0+ good
tblastn 2.9.0+ good
bowtie2-build 2.3.5 good
bowtie2 2.3.5 good
samtools 1.9 good
java 11.0.1-internal good
pilon 1.23 good
bcftools not used
SPAdes read error correction (2019-11-03 22:47:16)
Unicycler uses the SPAdes read error correction module to reduce the number
of errors in the short read before SPAdes assembly. This can make the assembly
faster and simplify the assembly graph structure.
Command: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/read_correction --threads 12 --only-error-correction
Corrected reads:
/home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_1.fastq.gz
/home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_2.fastq.gz
Choosing k-mer range for assembly (2019-11-03 23:47:00)
Unicycler chooses a k-mer range for SPAdes based on the length of the input
reads. It uses a wide range of many k-mer sizes to maximise the chance of
finding an ideal assembly.
SPAdes maximum k-mer: 127
Median read length: 150
K-mer range: 27, 47, 63, 77, 89, 99, 107, 115, 121, 127
SPAdes assemblies (2019-11-03 23:48:07)
Unicycler now uses SPAdes to assemble the short reads. It scores the
assembly graph for each k-mer using the number of contigs (fewer is better) and
the number of dead ends (fewer is better). The score function is 1/(c*(d+2)),
where c is the contig count and d is the dead end count.
K-mer Contigs Dead ends Score
27 failed
47 failed
63 failed
77 failed
89 failed
99 failed
107 failed
115 failed
121 failed
127 198 0 2.53e-03 <-best
Read depth filter: removed 216 contigs totalling 72842 bp
Deleting /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/
Determining graph multiplicity (2019-11-04 00:07:15)
Multiplicity is the number of times a sequence occurs in the underlying
sequence. Single-copy contigs (those with a multiplicity of one, occurring only
once in the underlying sequence) are particularly useful.
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/001_best_spades_graph.gfa
Cleaning graph (2019-11-04 00:07:15)
Unicycler now performs various cleaning procedures on the graph to remove
overlaps and simplify the graph structure. The end result is a graph ready for
bridging.
Graph overlaps removed
Removed zero-length segments:
111, 112, 116, 117, 122, 123, 125, 126, 134, 135, 140, 143, 144, 145, 146,
149, 151, 153, 158, 161, 169, 170, 171, 172, 179, 181, 183
Removed zero-length segments:
110, 165
Removed zero-length segments:
195
Merged small segments:
184, 185, 186, 187, 192, 196, 197
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/002_overlaps_removed.gfa
Unicycler now selects a set of anchor contigs from the single-copy contigs.
These are the contigs which will be connected via bridges to form the final
assembly.
43 anchor segments (5,368,430 bp) out of 162 total segments (5,423,141 bp)
Creating SPAdes contig bridges (2019-11-04 00:07:15)
SPAdes uses paired-end information to perform repeat resolution (RR) and
produce contigs from the assembly graph. SPAdes saves the graph paths
corresponding to these contigs in the contigs.paths file. When one of these
paths contains two or more anchor contigs, Unicycler can create a bridge from
the path.
Bridge
Start Path End quality
-35 -58 44 18.2
-19 109 -> 114 -> 95 -> -136 -> 106 -> 112 -> 123 -> 80 -> 130 -> 65 -> 136 -> 110 -> -145 -> -103 -> -109 -> 131 43 12.0
2 -42 37 9.3
3 102 10 62.2
5 -96 -> 152 -> -96 -12 37.8
8 -102 -15 61.8
10 -98 -> 46 -> -98 1 7.3
16 -150 -> 113 -> 133 -> -103 -> -109 -> 131 -> 45 -> 132 -> 109 -> 114 -> 124 -> 129 -> 83 -> 121 -> -117 -> -88 -> 156 -28 7.5
22 -57 -40 11.8
37 -42 4 8.7
Creating loop unrolling bridges (2019-11-04 00:07:15)
When a SPAdes contig path connects an anchor contig with the middle contig
of a simple loop, Unicycler concludes that the sequences are contiguous (i.e.
the loop is not a separate piece of DNA). It then uses the read depth of the
middle and repeat contigs to guess the number of times to traverse the loop and
makes a bridge.
Loop count Loop count Loop Bridge
Start Repeat Middle End by repeat by middle count quality
2 -42 37 4 1.07 0.97 1 40.5
5 -96 152 -12 0.67 0.60 1 36.4
9 97 147 12 0.82 0.71 1 40.1
10 -98 46 1 0.12 0.88 1 37.5
Applying bridges (2019-11-04 00:07:15)
Unicycler now applies to the graph in decreasing order of quality. This
ensures that when multiple, contradictory bridges exist, the most supported
option is used.
Bridge type Start -> end Path Quality
SPAdes 3 -> 10 102 62.210
SPAdes 8 -> -15 -102 61.764
SPAdes 5 -> -12 -96, 152, -96 37.778
SPAdes -35 -> 44 -58 18.167
SPAdes -19 -> 43 109, 114, 95, -136, 106, 112, 123, 80, 11.994
130, 65, 136, 110, -145, -103, -109, 131
SPAdes 22 -> -40 -57 11.829
loop 2 -> 4 -42, 37, -42 40.460
loop 9 -> 12 97, 147, 97 40.120
loop 10 -> 1 -98, 46, -98 37.498
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/003_bridges_applied.gfa
Bridged assembly graph (2019-11-04 00:07:16)
The assembly is now mostly finished and no more structural changes will be
made. Ideally the assembly graph should now have one contig per replicon and no
erroneous contigs (i.e a complete assembly). If there are more contigs, then
the assembly is not complete.
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/004_final_clean.gfa
Component Segments Links Length N50 Longest segment Status
1 131 184 5,430,795 1,298,586 1,645,591 incomplete
Polishing assembly with Pilon (2019-11-04 00:07:16)
Unicycler now conducts multiple rounds of Pilon in an attempt to repair any
remaining small-scale errors with the assembly.
Aligning reads to find appropriate insert size range...
Insert size 1st percentile: 190
Insert size 99th percentile: 513
Pilon polish round 1
Total number of changes: 29
Pilon polish round 2
Total number of changes: 4
Pilon polish round 3
No Pilon changes
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/005_polished.gfa
Assembly complete (2019-11-04 02:13:11)
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.gfa
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.fasta
总共耗时3小时左右,比较慢。
4. 使用QUAST评估两个软件组装结果。
(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta
/home/czh/miniconda3/envs/qc_genome/bin/quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta
Version: 5.0.2
System information:
OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid (linux_64)
Python version: 3.6.7
CPUs number: 16
Started: 2019-11-04 06:14:57
Logging to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log
CWD: /home/czh/Desktop/01_denovo_genome/N46_DENOVO
Main parameters:
MODE: default, threads: 8, minimum contig length: 500, minimum alignment length: 65, \
ambiguity: one, threshold for extensive misassembly size: 1000
Contigs:
Pre-processing...
1 spades_assembly.fasta ==> spades_assembly
2 unicycler_assembly.fasta ==> unicycler_assembly
2019-11-04 06:15:06
Running Basic statistics processor...
Contig files:
1 spades_assembly
2 unicycler_assembly
Calculating N50 and L50...
1 spades_assembly, N50 = 472283, L50 = 5, Total length = 5427777, GC % = 57.74, # N's per 100 kbp = 15.92
2 unicycler_assembly, N50 = 1298586, L50 = 2, Total length = 5419702, GC % = 57.76, # N's per 100 kbp = 0.00
Drawing Nx plot...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/Nx_plot.pdf
Drawing cumulative plot...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/cumulative_plot.pdf
Drawing GC content plot...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/GC_content_plot.pdf
Drawing spades_assembly GC content plot...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_GC_content_plot.pdf
Drawing unicycler_assembly GC content plot...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/unicycler_assembly_GC_content_plot.pdf
Drawing Coverage histogram (bin size: 24x)...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/coverage_histogram.pdf
Drawing spades_assembly coverage histogram (bin size: 24x)...
saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_coverage_histogram.pdf
Done.
NOTICE: Genes are not predicted by default. Use --gene-finding or --glimmer option to enable it.
2019-11-04 06:15:08
Creating large visual summaries...
This may take a while: press Ctrl-C to skip this step..
1 of 2: Creating Icarus viewers...
2 of 2: Creating PDF with all tables and plots...
Done
2019-11-04 06:15:08
RESULTS:
Text versions of total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.txt, report.tsv, and report.tex
Text versions of transposed total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
HTML version (interactive tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.html
PDF version (tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.pdf
Icarus (contig browser) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/icarus.html
Log is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log
Finished: 2019-11-04 06:15:08
Elapsed time: 0:00:11.039108
NOTICEs: 1; WARNINGs: 0; non-fatal ERRORs: 0
Thank you for using QUAST!
quast评估结果可以看出unicycler组装结果优于spades,unicycler应该是矫正和剔除了长度太短和一些冗余的contig,减少了contig的数量,提高了N50的长度,但基因组总长度小于spades组装。
5. unicycler调用spades初步组装后对contigs的处理。
区分单拷贝的contigs和重复的contigs,选择合适的单拷贝contigs进行下一步处理。
剔除contig间的重叠区域,为了尽可能的环化contigs。
单拷贝contigs桥接图
最后使用Pilon将短序列mapping到组装好的基因组上进行矫正。
unicycler相较于spades耗时较长,组装结果较好,这只是一个菌基因组的组装比较结果,可能其他菌株使用unicycler组装并不会有较大提升,反而会降低N50,不过N50不是评估基因组的唯一标准,过于苛求更长的N50可能会导致错误组装,可能会误导后续基因注释和功能研究。