Preprint reviews by Keith Robison

Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost

David W Mohr, Ahmed Naguib, Neil Weisenfeld, Vijay Kumar, Preyas Shah, Deanna M Church, David Jaffe, Alan F Scott

Review posted on 22nd April 2017

Table 1 -- re-ordering the columns in a logical progression from left-to-right would scan better -- so DISCOVAR, 10X Supernova 1.1, BNG+Supernova 1.0, BNG+Supernova 1.1 (and a bit odd that Supernova 1.0 is omitted)

Table 2 takes a lot of space and isn't really giving more information than Figure 5 -- it would be preferable to have plots like Figure 5 for more chromosomes or a table listing the number of scaffolds & the sizes of scaffolds for each chromosome

The fact that the BNG data greatly reduced the number of scaffolds but had only a modest effect on N50 should be discussed. Is this a limit on scaffolding through centromeres? Do any scaffolds appear to cover an entire chromosome arm? Do any cross a centromere? It might be useful to discuss the known chromosome structure of pinnipeds as described in Beklemisheva 2016 -- adapting their Figure 4 to show how your scaffolds relate to human-seal and dog-seal synteny blocks would be valuable.

Figure 4 -- what region is this? Citation for the fact it is a breakpoint in many genome comparisons? Does this map to a known join vs. human karyotype as described in Beklemisheva? This point would be interesting to see discussed.

p.7 "acrocentric human chromosome which are" -- should identify which acrocentric chromosome(s) are being referred to.

It would be of interest to the genomics community to have a histogram of estimated fragment lengths based on the 10x read clouds and the observed lengths of BNG fragments. It would also be useful to have statistics on anomalously-mapping reads -- those that map outside the scaffold to which the majority of the cloud's reads are assigned. A histogram of number of reads per UMI might also be interesting.

Figure 1 -- move the legend into the plot by labeling the lines-- much easier to read, particularly for the colorblind (red vs. green is never a good choice for that reason)


Discovered a small hitch in one thing I suggested -- the Baikal seal in the Beklemisheva analysis has 2n=32 but Hawaiian monk seals have 2n=34 (Lu et al 2000). According to Arnason 1974 the 2n=34 karyotype is probably ancestral with a single fusion generating the 2n=32 karyotype. Fronicke et al 1997 would make the fused chromosome "S", which is homologous to human chromosomes 17 and 5. Some more musings over on the blog

show less


A portable system for metagenomic analyses using nanopore-based sequencer and laptop computers can realize rapid on-site determination of bacterial compositions

Satomi Mitsuhashi, Kirill Kryukov, So Nakagawa, Junko S Takeuchi, Yoshiki Shiraishi, Koichiro Asano, Tadashi Imanishi

Review posted on 21st January 2017

Figure 4b: please make ines much heavier & cinder avoiding color for distinguishing stages - different shapes would work better (e.g. rectangle, hexagon, oval) for individuals with color perception issues.

If you are going to claim portability, then a complete list of equipment is needed, with weights. What are power requirements? Refrigeration requirements?

What is the yield from each run? What fraction of reads in each were unclassifiable? What is expected sensitivity in a more complex sample?

Discussion and legend for 4b should emphasize that you started with purified DNA, not bacteria. So time to lyse&purify DBA would need to be added to running time, and differential lysis/extraction could shift your sensitivity.

show less


INC-Seq: Accurate single molecule reads using nanopore sequencing

Chenhao Li, Kern Rei Chng, Jia Hui Esther Boey, Hui Qi Amanda Ng, Andreas Wilm, Niranjan Nagarajan

Review posted on 22nd April 2016

Section 2.2, line 12 " pooled (by mass)" -- does "by mass" meaning normalized to equal masses of each input?

Table 1 -- Defined abundances have only single significant digit -- is this indicative of the precision of the mixing? The SMRT & ONT 2D are given to two significant figures -- is this appropriate? Table 3 gives the values to 3 significant figures .

Table 2 has 0 for the input abundance of the Klebsiella strains -- is this correct? If so, why was it detected in the community?

Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? If not, please specify what is required in your comments to the authors.

Yes.

Are the conclusions adequately supported by the data shown? If not, please explain in your comments to the authors.

Yes.

Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? If not, please specify what is required in your comments to the authors.

Yes.

Are you able to assess any statistics in the manuscript or would you recommend an additional statistical review? If an additional statistical review is recommended, please specify what aspects require further assessment in your comments to the editors.

There are no statistics in the manuscript.

Quality of written English

Please indicate the quality of language in the manuscript:

Acceptable.

Declaration of competing interests

Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

I have been a participant, via my company, in the Oxford Nanopore MinION Access Program,
which has arguably provided reagents with greater value than the $1000 entrance fee.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0140-7/13742_2016_140_AuthorComment_V1.pdf)


show less


A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data

Judith Risse, Marian Thomson, Garry Blakely, Georgios Koutsovoulos, Mark Blaxter, Mick Watson

Review posted on 31st August 2015

By demonstrating a closed bacterial genome generated by an Oxford Nanopore-Illumina hybrid assembly, the authors demonstrate a useful milestone. However, the presentation has a number of glaring flaws (as well as some smaller ones) which greatly decrease the potential value and impact of this manuscript. To start with one of the minor sins, the authors describe the competing PacBio sequencing as "uses a modified DNA polymerase and produces long...". This description of the technology relies solely on a factor which is neither unique nor informative: every polymerase-based high throughput sequencing system (and indeed vast amounts of Sanger data) have used modified polymerases. Given the amazing technology of the PacBio -- it, after all, uses optical methods to observe the incorporation kinetics of individual DNA polymerase enzymes -- this isn't a good start, since PacBio is the reigning leader in what the manuscript is doing. Of note, my office stapler is much larger than a MinION. Giving actual dimensions might be helpful. The authors mention the prior, all-nanopore, MinION assembly of a similar bacterium and note the error statistics for this assembly, setting up a comparison which unfortunately they fail to follow through with. While a number of figures and paragraphs are spent analyzing the accuracy of the Nanopore reads, only a scant paragraph on final assembly quality exists and this contains no statistics on the assembly. The authors appear to believe that all discrepancies between their assembly and the best available reference, a different strain of the same organism, are true strain differences, but the number of deviations are not specified. It would be also useful to understand the range of Illumina coverage across the final assembly, and in particular how well supported are the final joins and gap-fills made during the assembly process. It would also be of interest to many readers to understand better how much the MinION data boosted the assembly; comparison of a SPADES assembly made only with the Illumina data would be most instructive, even if only to give the number of contigs and the NG50, but actually analyzing the nature of what is successfully spanned by the MinION reads would be informative. It would be also useful for more analysis of why the initial MinION reads did not lead to a fullyclosed assembly. Analysis of the scaffolding process in this sort of detail -- what couldn't be spanned and why were other programs using the same data able to push through -- would raise this from a routine genome announcement to a useful addition to the genome assembly literature. Also, it is curious that the authors shade the all-MinION assembly with "However the assembly process was complex"; that paper used an error correction step, assembly and several rounds of polishing. This manuscript uses read trimming (of the Illumina reads), assembly, 2 rounds of scaffolding followed by one round of gap filling. It is difficult to see a significant difference in the level of skill required to implement either of these procedures. One other minor note, the mention of invertible promotes in the Discussion lacks the citation for this -- yes, it was used previously in the text, but since many readers may first jump in and encounter this item in this space, it is worth repeating the footnotes here. Level of interest Please indicate how interesting you found the manuscript: An article whose findings are important to those with closely related research interests Quality of written English Please indicate the quality of language in the manuscript: Acceptable Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: 1. Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 2. Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 3. Do you hold or are you currently applying for any patents relating to the content of the manuscript? 4. Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? 5. Do you have any other financial competing interests? 6. Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I am, via my employer, also a participant in the Oxford Nanopore MinION Access Program, and as such receive free consumables for the system. However, I feel I can remain an impartial reviewer of data from this system, as I think this review will demonstrate. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.


The reviewed version of the manuscript can be seen here:

All revised versions are also available:

show less


Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

Review posted on 09th February 2015

This title of this manuscript would lead a reader to believe that a careful comparison of two broad strategies for de novo sequence assembly. Unfortunately, what the manuscript delivers is an error-rich and outdated introduction, incompletely defined methods and an extremely limited comparison on a single dataset of two assembly programs.

In their introduction the authors dig far into the history of DNA sequencing nearly to the very beginning. However, this summary is filled with dubious assertions that lack citations. For example, they credit PCR with boosting Sanger sequencing over Maxam-Gilbert sequencing, but Sanger sequencing had already all-but-extinguished Maxam-Gilbert before PCR had become commonly used in any facet of sequencing, and even today Sanger sequencing does not have a reliance on PCR (the authors may be confusing cycle sequencing, which relies on a linear amplfication using thermostable polymerases, with PCR).

The authors present in figure form a comparison of four "next generation sequencing" systems (a term that really should be retired, given the fact that these systems are over a decade old now). The figure is exquisitely badly formatted and nearly unreadable when printed due to using a thin white font on dark backgrounds; if the information was of any value it should have been formatted as a table. Alas, the information in the table is worse than its formatting, being completely out-of-date.

The statistics given for Illumina sequencing, which name an instrument (the GA) discontinued several years ago, give a cost per base that is roughly right for the MiSeq platform, but the read lengths on that platform are far longer (now 2x300). Several of the other Illumina platforms offer longer read lengths than given with a cost per basepair which are several orders of magnitude lower than given in the figure.

Another quadrant of the figure describes the SOLiD system, which has rarely been used for de novo assembly. In any case, the number of reads per run and read length are both wrong, which leads to the cost per basepair being off by over an order of magnitude.

A third quadrant gives obsolete statistics for the 454 platform, which hit read lengths of over 800 bases (or >2X that given in the figure). However, that really doesn't matter except historically since Roche discontinued the 454 platform in 2014. Even worse is the 4th quadrant, which describes the Helicos sequencer, a comparny that went bankrupt in 2011.

The Ion Torrent systems, despite being used frequently for small genome de novo assembly, are not mentioned anywhere. Missing from the table, but briefly mentioned in the text, is the Pacific Biosciences platform. Given that PacBio has been used extensively for de novo assembly, this is unexcusable. Furthermore, the paper fails to mention that PacBio is very different in its read characteristics, particularly read length.

A section on the experimental workflows for these systems attempts to summarize all of them in one paragraph. There is one serious error here; in library preparation PCR is performed after ligation of adaptors and not before. More seriously, the described workflow does not apply to either of the single molecule systems which they have mentioned; neither uses PCR and Helicos didn't even have a ligation step.

A brief summary of de novo sequence assembly algorithms has a few small errors (for example, while many implementation of overlap-layout-consensus (OLC) use k-mers to speed execution, k-mer analysis is not an inherent facet of the algorithm as the authors suggest). More serious is that only de Bruijn graph and OLC are discussed; string graphs are omitted and would be very relevant to the purported scope of this paper.

For the paper, the authors downloaded a single dataset from the Assemblathon dataset, for Zebrafish (oddly described as "fish species M.zebrafish"). No explanation is given why this dataset was chosen. Some assembly runs involved removing low quality data from the dataset, but no explanation is given as to what criteria were used to define low quality or tools used to remove them. This relates to another gaping hole in the manuscript: numerous approaches for preprocessing data have been described in the literature, including read filtering, read trimming, error correction, paired end merging and k-mer based normalization; none of these topics are broached. This will become clearly unfortunate later in their manuscript.

While the title promises a significant comparison of methods, the manuscript describes using only two programs: Velvet standing in for single compute node de Bruijn graph algorithms and Contrail for distributed computing de Bruijn graph assemblers. While a few other DBG assemblers are mentioned, the existence of other DBG assemblers which can run across multiple compute nodes are not (e.g. Ray, ABySS). Since the manuscript focuses on the Hadoop aspect of Contrail (which is the framework it uses to distribute the computing across multiple nodes), the paper could leave the unfortunate impression that this is the only attempt in the field, rather than one of many mechanisms (e.g. MPI)

The authors begin by trying a sampling of k-mer values for both Velvet and Contrail using a 2X dataset. The method used for downsampling the dataset is not given (while the text promises that the Perl programs used are available on request, this should be seen as an unacceptably inadequate mechanism; at a minimum they must be supplementary materials but better would be deposition in a public code repository). They measure two figures-of-merit (N50 and maximum contig size), but plot only one of them (though this plot is the best single element in the paper). A justification for using a 2X sample, rather than a larger one, is not given. This opens the question whether a larger k-mer length might have worked better on a larger dataset.

The authors proceed to try both programs on the entire dataset; Contrail succeeds but Velvet fails. Velvet fails again on 50% of the data if in paired end mode (though again, the method of downsampling is not given) but runs on that dataset in unpaired read mode. Velvet is tried also, in both modes, on a 25% dataset and succeeds. The authors present the figures-of-merit as a table, with no apparent order. Since Contrail (at least the version used) had only an unpaired mode, it is run only once. This data would be far more useful plotted as a graph as well, with the table sorted in some order relevant to the user, such as the 2X, 25%, 50% and 100% of dataset.

A serious issue at this point is the author's choice of N50 and maximum contig length as their sole figures-of-merit, which they mistakenly label as measures of assembly quality. At no point do the authors attempt to assess the correctness of their assemblies, despite this being a standard method in assembler comparisons (such as the Assemblathon from which the authors obtained the data used). Both N50 and maximum contig length can be inflated by overly aggressive assembly that yields misassembly artifacts, and N50 can be inflated by the choice of a minimum contig length cutoff. Indeed, the authors fail to report a genome size their assemblies, and so these assemblies could represent only a fraction of the target genome.

The authors observe that the 25% and 50% datasets gave similar results for their figures-of-merit, and observe that less data can give equal or better results. They appear to have not asked if this has been observed before (it has). Nor do they run Contrail on the subsampled datasets to see if the trend holds there as well.

The issue of Velvet crashing on the larger datasets is presented as highly significant; indeed the conclusion is drawn that multi-machine programs such as Contrail are required for this data. This is highly unfortunate on two grounds.

First, as noted before, the authors performed nearly no preprocessing of the data (other than the ill-documented poor quality read removal). Sequencing errors will enlarge the de Bruijn graph, so error correction or read trimming can reduce the memory requirements of an assembler. Paired end merging can similarly reduce memory requirements, albeit at some risk of telescoping small repeats. Merging is particularly relevant for Contrail, since it does not explicitly handle paired ends. K-mer based read normalization can greatly reduce memory requirements for assembly.

Second, a number of programs have demonstrated assembly of vertebrate-scale short read datasets on single machines, indeed single machines with far less memory than the 1Tbyte found compute node used for Velvet in the paper. Examples include Minia, with a de Bruijn graph structure designed to be extremely memory efficient, and Readjoiner, which uses a string graph paradigm (which, as noted above, is a strategy ignored by the paper in the introduction).

Finally, the authors fail to make any attempt to place this in a relevant modern context. Given that short reads from short inserts alone are mathematically incapable of assembling anything but the simplest plasmid or viral genomes, the current thrust in de novo assembly is assembling either entirely from long reads or integrating short reads with long reads or mate pairs to accurately yield long (increasingly, chromosome-scale) scaffolds. Failing to place the very limited findings of this manuscript in such a context could be characterized as a final failing.

show less