Minh Duc Cao, Devika Ganesamoorthy, Alysha Elliott, Huihui Zhang, Matthew Cooper and Lachlan Coin
Review posted on 10th May 2016
This manuscript has previously been reviewed in the AcademicKarma platform here: http://academickarma.org/review/f4n4ef16fy62
The comments have been re-uploaded and archived here: Cao et al present streaming algorithms for the identification of pathogens and antibiotic resistance genes using "real time" MinION sequencing. The paper has some interesting points in terms of proof-of-principle, though there is a lot of work needed before this could be implemented in practice.
A major point of discussion needs to be the fact that base-calling, via the cloud-based base caller Metrichor, is a major bottleneck to the pipeline and can add hours to the process. By using prebasecalled data the authors bypass this issue, however in a real setting, the online base-calling would be an issue. Rapid matching on "sqiggle data" (e.g. http://biorxiv.org/content/early/2016/02/03/038760) could be discussed as a potential solution.
I don't think the "pipeline" itself is described in sufficient detail and I would suggest a flowchart showing information flow through the pipeline, including software names/versions. I am also unsure the pipeline is genuinely an example of "streaming", which usually refers to the fact that data are not written to disk, simply piped from one process to another. However, the FAST5 files are written to disk by metrichor, sequence data extracted using npReader and (I assume) written to disk, picked up by BWA and then the output of BWA is streamed to other processes. The authors may or may not be aware, but there is an API to MinKNOW that allows genuine streaming of data from the MinION device. Again, these points should be discussed.
The authors state:
"We developed a novel strain typing method to identify the bacterial strain from the MinION sequence reads based on patterns of gene presence and absence."
I would like to know how this differs from metagenomic profilers such as Kraken (and many others), and indeed why the authors couldn't use one of these existing pipelines.
Finally, though the results are interesting, the conclusions are limited as the authors use pure cultures and I would be very interested to see how the platform performs on genuine clinical samples. The authors should also be aware of Phelim et al (http://www.nature.com/ncomms/2015/151221/ncomms10063/full/ncomms10063.html) which has a section on use of MinION for AMR typing.
Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included?
If not, please specify what is required in your comments to the authors.
Are the conclusions adequately supported by the data shown?
If not, please explain in your comments to the authors.
Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting?
If not, please specify what is required in your comments to the authors.
Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used?
Yes, and I have assessed the statistics in my report.
Quality of written English
Please indicate the quality of language in the manuscript:
Declaration of competing interests
Please complete a declaration of competing interests, consider the following questions:
Have you in the past five years received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold or are you currently applying for any patents relating to the content of the manuscript?
Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?
Do you have any other financial competing interests?
Do you have any non-financial competing interests in relation to this manuscript?
If you can answer no to all of the above, write ‘I declare that I have no competing interests’ below. If your reply is yes to any, please give details below.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.
Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0137-2/13742_2016_137_AuthorComment_V1.pdf)
Keith R. Bradnam, Joseph N. Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İ, nanç, Birol, Sé, bastien Boisvert, Jarrod A. Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T. Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A. Fonseca, Ganeshkumar Ganapathy, Richard A. Gibbs, Sante Gnerre, É, lé, nie Godzaridis, Steve Goldstein, Matthias Haimel, Giles Hall, David Haussler, Joseph B. Hiatt, Isaac Y. Ho, Jason Howard, Martin Hunt, Shaun D. Jackman, David B Jaffe, Erich Jarvis, Huaiyang Jiang, Sergey Kazakov, Paul J. Kersey, Jacob O. Kitzman, James R. Knight, Sergey Koren, Tak-Wah Lam, Dominique Lavenier, Franç, ois Laviolette, Yingrui Li, Zhenyu Li, Binghang Liu, Yue Liu, Ruibang Luo, Iain MacCallum, Matthew D MacManes, Nicolas Maillet, Sergey Melnikov, Bruno Miguel Vieira, Delphine Naquin, Zemin Ning, Thomas D. Otto, Benedict Paten, Octá, vio S. Paulo, Adam M. Phillippy, Francisco Pina-Martins, Michael Place, Dariusz Przybylski, Xiang Qin, Carson Qu, Filipe J Ribeiro, Stephen Richards, Daniel S. Rokhsar, J. Graham Ruby, Simone Scalabrin, Michael C. Schatz, David C. Schwartz, Alexey Sergushichev, Ted Sharpe, Timothy I. Shaw, Jay Shendure, Yujian Shi, Jared T. Simpson, Henry Song, Fedor Tsarev, Francesco Vezzi, Riccardo Vicedomini, Jun Wang, Kim C. Worley, Shuangye Yin, Siu-Ming Yiu, Jianying Yuan, Guojie Zhang, Hao Zhang, Shiguo Zhou, Ian F. Korf
Review posted on 11th February 2013
The authors describe the outputs of “Assemblathon 2”, an international effort to compare genome assemblers, assembly strategies and the usefulness of various assembly metrics. The major conclusions appear to be that most assembly strategies produce useful assemblies, not all strategies work on all types of genome and all types of data, and that no single metric can be used to assess assembly quality.
Clearly a lot of work has gone into this manuscript, and a lot of information is presented to the reader. The authors must be congratulated on gathering such a large amount of work into a single, coherent paper.
As a technical paper describing the use of a variety of metrics to assess the quality and diversity of multiple assemblies, the paper is publishable “as is”, and therefore everything is a “discretionary revision”. However, the paper could be improved in several ways.
The first question which occurs to me is this: what was the purpose of this international effort? What were the authors trying to achieve? Was it:
- To catalogue available assemblers?
- To compare available assemblers?
- To develop best practice?
- To develop a set of guidelines? i.e. which assembler should I use on my data?
- To compare assembly metrics?
- To develop better assembly metrics?
I would encourage the authors to fully and clearly state the aims of the project; having read the manuscript, it still remains unclear what exactly the expected outcomes were, and whether they have been achieved.
The structure of the manuscript is complex. We have several factors at play here:
- 3 species
- 43 assemblies
- 10 different metrics
- Contigs vs scaffolds
I found the manuscript quite difficult to read, and a more logical structure might be:
- Define the different requirements one might have of an assembly (e.g. longest scaffolds; most genes in contigs etc)
- Define (and justify) the metric chosen to answer each question from (1)
- For each question/metric…
- For each species…
- For both contig and scaffold assemblies….
- Compare the performance of each assembler
I would have liked to have seen consistently separate analyses of “contig” and “scaffold” assemblies. For example, the “presence of core genes” analysis is performed only on scaffolds; other analyses separate contigs from scaffolds. The latter is important – for example, as a reader, I would want to know if e.g. a given assembler is brilliant at producing contigs but awful at scaffolding.
The choice of bird, fish and snake should be justified. This is very important. For the performance comparison of genome assemblers, the ideal scenario would be to select a range of genomes of varying size and complexity e.g. haploid, small diploid, large diploid, polyploidy, non-repetitive, repetitive etc. The three chosen genomes are of similar size (1.6Gb, 1.2Gb and 1Gb) and are all diploid (as far as I can tell). What was the justification for choosing these genomes?
The abstract mentions representation of “regulatory sequences” but I am not sure the paper actually addresses presence/absence of these?
There is an issue of “over assembly”, which the authors offer some possible explanations for (e.g. it’s possible certain teams have assembled multiple haplotypes of the same locus). No attempt is made to validate this explanation. Is it possible to extend the NG50 graphs beyond 100%, to identify assemblers that have “over assembled”?
Should gene-based metrics (“presence of core genes”) be calculated on contig assemblies? The danger is that, in scaffolds, parts of the gene fall in gaps. A very high quality gene-centric assembly would have the majority of genes in contigs. This is an important analysis, I feel.
Ranks are used consistently, but some additional judgement could also be used. For example, in the “COMPASS analysis of VFRs”, it is stated that the Ray assembler ranked 1st for all individual measures except multiplicity, where it ranks 7th. This does not paint the entire picture. Whilst this latter rank of 7th is true, on inspection of Figure 9, it is clear that Ray is still performing very well, and is part of a sub-group of assemblers whose performance is almost equally good. Pointing out that Ray ranks 7th may suggest to the reader that Ray’s performance is bad for multiplicity, when in effect the opposite is true.
Figure 20 shows that for all three species, the z-score is correlated very well with N50. The authors point out that only fish and snake are significant, but there certainly seems a good relationship in bird also. This is a useful conclusion. Whilst the authors are keen to point out that no single metric captures all information, this graph appears to show that if you have nothing else, N50 can be a useful single metric.
In the discussion it is pointed out that the SOAPdenovo entry used mislabelled mate-pair information and therefore the entry is incorrect. I would recommend either removing the SOAPdenovo data point, or repeating the analysis with a new, corrected SOAPdenovo entry
The discussion of the CoBig and PRICE assemblies seems out of place. The CoBig assembly is not analysed in any way. The PRICE assembly had different aims to every other assembler (to assemble only genic regions), and even in that, it failed, with a lower number of core genes than any other assembly. The fact it would come first if normalised to total amount of sequence assembled is irrelevant, as the aim was not to produce a full genome assembly.
One would hope that the outcome of efforts such as Assemblathon would be to create a useful guide for those new to the field, e.g. to answer questions such as:
- If I have a large, diploid genome, which assembler should I use?
- If I have a repeat-rich genome, which assembler should I use?
- If I have a polyploidy genome, which assembler should I use?
- Which assembler is “best” at scaffolding?
This relates to the purpose of the project, which I mention above. Whilst the authors have undoubtedly compared a multitude of assembly strategies, using a wide range of metrics, there is still no “best practice” or set of guidelines for choosing the best assembler for a particular biological problem. After such a lot of time and effort, the conclusions and five “practical considerations” seem like a disappointing outcome. I wonder if the authors could put more thought into this? For example, SGA seems very good at assembling the snake genome. Why? What is it about the snake genome that made SGA better than the rest? Was it the repeat structure of the genome? The type and amount of data? If someone comes along with a snake-like genome, should they choose SGA?
Is there any chance the group could take all of the data they have produced, and provide guidance for people who want to assemble genomes but don’t know which assembly strategy is best for their “type” of genome?
Level of interest: An article of importance in its field
Quality of written English: Acceptable
Statistical review: No, the manuscript does not need to be seen by a statistician.
Declaration of competing interests: I declare that I have no competing interests