Completed on 5 Oct 2016
Login to endorse this review.
I have read Cao and co-workers’ manuscript on npScarf with much interest. It describes an assembly scaffolding and finishing tool for hybrid assembly of Illumina and nanopore sequencing data. The stated motivation of the method is to provide a means for real time assembly algorithm that can be used in the field, and inform its user on when to stop a sequencing experiment.
Although the method provides competitive performance in comparison to present state-of-the-art, in my opinion it fails to deliver on its stated goals.
First, because it builds on assembled Illumina sequencing data, and because that step has to be performed offline, the primary use case of npScarf cannot be a field study.
We agree with the reviewer that the primary use case of npScarf is not in the field. We revised to make clearer that our method can scaffold and finish the assemblies concurrently with MinION sequencing, but the entire workflow including MiSeq sequencing cannot be performed in real-time with the current technology. We have also clarified that our primary applications are for 1. efficiently completing existing short-read genome assemblies, and 2. controlling MinION sequencing in new hybrid-assembly projects, which is substantially more expensive than short reads. [Page 2, column 1; Page 8 column 1;Page 8, column 2]
Second, the authors have the misconception that the MinION instrument returns its output one read at a time. Unfortunately (for the authors, but fortunately for the users) this is not the case. The instrument has many nanopores operating at the same time. So the actual data streaming happens as an asynchronous time series from multiple pores.
npScarf actually retrieves and processes data in a block-wise fashion, i.e., a small batch of reads sequenced within a small block of time. In this sense, npScarf can make use of reads sequenced from multiple pores rather than one read at a time. The length of the time block is a runtime parameter for the algorithm. We have made this clear in the revision [Page 9, column 1]
Having said that, I think the study has merit if the authors are willing to reframe the target application of their tool. It performs better than the comparators on the datasets they have tested, in terms of contiguity, quality and run time. (They should also include memory comparisons in their report.)
We have clarified the main use cases for npScarf as outlined above. We emphasise that our tool performs better than others on the tested data sets in terms of assembly contiguity and quality. We also show that our tool requires much less long read data to complete an assembly, and as such the ability to scaffold and report results on the fly will help to control MinIOn sequencing in completing existing short read assemblies as well as in hybrid assembly projects.
[Page 2, column 1; Page 7, column 2; Page 8 column 1]
We included a brief report on the memory consumption for our method in the manuscript (<4GB which is very small memory footprint) [Page 7, column 2]. While we are convinced that our method required less memory than competitive methods, we find these pipelines have differing computational settings (nanocorr and nas distribute computation to hundred of computer nodes; scaffolding methods can only be run after the execution of Spades which can be configured for different memory setting; and Canu and Miniasm require small memory footprint to assemble the genome, but the polishing step with Pilon consumes substantial amounts of memory), and hence a comparison is not practical. We instead include the memory consumption summary in the supplementary information [Supp. Tabble 1]
Of particular interest was, how the algorithm recovered from misassembled bridges. As the authors have demonstrated (though not explicitly stressed) as the target genome size and complexity increases, the assembly problem gets more challenging. As a result, using more data may help reduce misassemblies. This is reflected in their comparison of real time and batch assemblies of the S. cerevisiae dataset. My question is, can the algorithm recover from misassemblies in the batch mode. That is, if the algorithm were run in the batch mode with several partitions of the dataset (say, two or three), can the later batch runs correct for the earlier wrong decisions?
At any one point, our algorithm can provide the most likely assembly given the available data. In real-time mode, it can correct mis-assemblies (which are mainly due to false-positive alignments at earlier stages) in light of evidence from new data. As the reviewer pointed out, using more data can generate more reliable assemblies. In the real-time mode, our algorithm evaluates and reports the assembly after one or a few reads received, whereas in the batch mode, this is done after all reads are examined. The number of reads required before a new decision is made is a parameter for the algorithm. Hence the configuration the reviewer suggested can be done by setting this number to say one half or one third of the data set. The algorithm will run in real-time mode and will correct errors from the early batch when data of a later batch provide sufficient evidence.
I find the way the method determines unique contigs rather simplistic, and lacks proper statistical treatment. Coverage depth of a contig depends on several factors, including its length and GC content. In general, the longer a contig, the closer its average fold coverage to the overall genome fold coverage, and I do realize that the authors are aware of that (they are considering “up to 20 of the largest contigs longer than 20 Kb”.) Of course they may provide a better justification for these thresholds, or making them run time parameters, to help address this issue to a certain extent. However it would not alleviate the problem in the way they apply an inferred cut-off across the board, as shorter contigs will have wider fluctuations in their coverage profiles.
We provided more detailed description of the method to determine unique contigs. As the reviewer pointed out, the depth coverage estimate is more accurate with more and longer contigs. Our algorithm in fact uses an iterative approach to use the longest contigs possible so long as their depth coverage does not deviate from the estimate. In our experience, the largest 20 contigs provide sufficient ground for the estimate, i.e., the estimated coverage does not change significantly beyond 20 largest contigs. The threshold 20 Kb is another heuristic to ascertain the uniqueness of the contig when domain knowledge is available. The algorithm indeed takes this threshold as a runtime parameter, and considers a contig longer than this threshold unique without further test.
[Page 8, column 2]
I think, adding some logic to determine the completeness of a bacterial genome/plasmid assembly through key features is a great idea. Though it needs to be better explained how it is integrated in the assembly algorithm, including how to invoke it during a run.
The algorithm automatically reports the statistics of the assembly (N50, total length, number of contigs and number of circular contigs) during the scaffolding progress. Determining the completion of an assembly is left for the users depending on the domain knowledge and the applications of the experiment. For instance, the assembly of a bacterial genome can be considered completed when all contigs are circular. The users can also choose to terminate a sequencing when all plasmids are completed if plasmids are the focus of the experiment.
In our current implementation, the statistics are continuously outputted to the console or to a text format. We are working on a graphical interface for better visualisation of the assembly. We will report this in a manuscript in the near future.