Completed on 25 Jul 2015
Login to endorse this review.
The manuscript "Real-time strain typing and analysis of antibiotic resistance potential using Oxford Nanopore MinION sequencing" by Cao et al. describes strain typing and resistance profiling of three Klebsiella strains obtained form ATCC. The main advancement described in this manuscript is the development of a software pipeline that allows the analysis of MinION data in a rapid fashion, and the development of a new approach to strain typing based on the presence/absence of genes, which is better suited for the analysis of MinION data rather than multi-locus sequence typing due to the relatively high error rate of MinION sequences. Overall, I believe that this manuscript is of interest to a wide audience, and showcases one of the main advantages of the MinION sequencing approach, i.e. the availability of sequencing data in real time. However, there are also some shortcomings with the study: The authors use bacteria cultures as starting material
We have changed the abstract de-emphasise the clinical application and rewritten the paper to focus on the methodological aspects of the paper, which is where the novelty of this work lies.
analysis - it remains to be demonstrated that the method they propose would actually work on clinical samples, where there might be a lot less DNA present, and also abundant irrelevant (e.g. human) DNA present. Also, regarding the strain detection by presence or absence of genes, the authors would do well to include further information in their manuscript.
Specific points that should be addressed are:
As remarked in response to reviewer 1, we now show sequence yield on as well as proportions in Figures 4 and 5 and now report the amount of data generated at all of the critical timepoints as well as the time taken.
1) The authors show all data as a function of sequencing time. This is of course the most intuitive way to present these data; however, given the fact that there will be future changes to sequencing speed (i.e. "fast-mode"), which should increase throughput significantly, it might be helpful to also present these data as a function of reads (aligned to their data base). This would also be an important and maybe more objective representation since the number of reads that align to bacterial genomes will significantly decrease once clinical samples are used, due to lower abundance of bacterial DNA and a presumably high abundance of irrelevant (e.g. human) DNA. Further, this would allow to compare the authors data more easily to other studies.
We now include a description of the read quality, read length and yield in the Result section, Figures 2 and 3, and Table 2.
2) Also, in order to allow easier comparison to other studies the authors should report on the overall accuracy of the sequencing data they observed (comparing the aligned individual reads to the reference genome), both for all reads and 2D reads only.
We have tried to make it clear that we use all reads for analysis:
"Our pipeline allows for filtering out 1D reads at multiple stages (including via npReader). All subsequent analyses in this paper used both 1D and 2D reads."
3) It is unclear whether the authors used all reads or 2D reads only for their analysis.
We now provide more information about this approach in the methods section. We have introduced a new sample in this paper which is a mixture of S. aureus and E. coli, and have compared the number of reads for successful strain typing in these species versus K. Pneumoniae, and how that may be affected by genome plasticity as follows
"Our pipeline identified genes present in the sample from sequence reads as they were generated by the MinION device. It used this information to infer the posterior probability of each of the strain types, as well as the 95% confidence intervals in this estimate (see Methods). For our three K. pneumoniae samples, we successfully identified the corresponding strain types from the sequence data with 95% confidence within 10 minutes of sequencing time and with as few as 200 sequencing reads (Figures 4a), b), and c)). We streamed sequence reads from the mixture sample through the strain typing systems for E. coli and S. aureus and in both cases, the correct strain types of two species in the sample were also recovered. The correct type for E. coli strain in the the 75%/ 25% E. coli,S. aureus mixture was recovered after 25 minutes of sequencing with about 1000 total reads (or approximately 750 E.coli derived reads). (Figure 4d)). The pipeline was able to correctly predict the S. aureus strain (which is known to have much less gene content variation) in this mixture sample after two hours of sequencing with about 2,800 total reads (or approximately 700 S.aureus derived reads)."
4) The authors propose a new approach to strain typing based on the presence and absence of genes. However, they provide little information about this approach. Questions they should address: How much difference is there in the presence of individual genes between different Klebsiella strains? Does each strain actually have a unique set of present genes that allows clear identification? How many present genes have to be identified to allow for reliable strain typing? Based on their experiences, how many sequencing reads does this equate to?
This is an excellent suggestion, and since both the MLST typing methodology and the strain typing methodology are likelihood based, this would certainly be possible, however we feel it is beyond the scope of this work.
5) Can the authors combine their new typing approach with MLST typing, to reach a better overall typing performance?
We have sequenced a mixed sample of S. aureus (25%) and E. coli (75%) and we have applied our pipelines to this dataset to demonstrate the robustness of the inference in the presence of multiple species.
6) What happens to their workflow if contaminant DNA sequences are introduced (e.g. human sequences, other bacterial sequences)? Of course best would be to experimentally use clinical samples as starting material, but this might go beyond the scope of this paper. However, MinION data from other publications, including those from human genes, are available. Would it be possible for the authors to at least use these data to introduce "contaminations" in silico?
We use BWA-MEM instead of lastal or margin align, because this allows streaming analysis. We adjusted the parameters of BWA-MEM to achieve similar performance to lastal and marginAlign, and we make this clear in the methods section. We use kalign2 for the multiple alignment step in order to call a consensus sequence for both the MLST typing step as well as the antibiotic resistance gene detection, and this is also now made clearer in the methods section.
7) The authors use kalign2 for their analysis. How does this compare to other aligners often used for analysis of MinION data, e.g. lastal or marginAlign, which was specifically developed to process MinION data?
Our pipeline does indeed allow for real-time analysis. Because the datasets we report in the manuscript were sequenced before we developed the pipeline, we retro analysed them by emulating the sequencing timing (that is, presented the reads to the pipeline in exact timing and order of they were sequenced). We have subsequently successfully ran the pipeline in genuine real-time with the MinION sequencing on several datasets (not presented in the paper). In order to examine the actual computational timing (and hence the analysis throughput of our pipeline), we emulated the timing in over 100 times higher that that we obtained, and our pipeline could analyse those in a 16-CPU computer. We have added the following in the section 'Computational time':
“In our analyses, sequence reads were streamed through the pipeline in the exact order and timing as they were generated. Analysis results were generated periodically (every minute for species typing and strain typing and every five minutes for resistance gene identification). We examined the scalability of the pipeline to higher throughput by running the pipeline on a single computer equipped with 16 CPUs and streaming all sequence reads from the highest yield run (185Mb from sample K. pneumoniae ATCC BAA-2146) through the pipeline at 120 times higher speed than they were generated (eg, data sequenced in 2 minutes were streamed within 1 second). Analysis results were generated every 5 seconds for typing and every one minute for gene resistance analysis. With this hypothetical throughput, our pipeline correctly identified the species and strain of the sample in less than 20 seconds, upon which we could terminate the typing analyses. The pipeline then reported all the resistance genes in five minutes, which corresponded to the data generated in the first 10 hours of actual sequencing. This demonstrates the scalability of our pipeline to higher throughput sequencing."
8) While the authors claim real-time sequencing, it appears to me that they analyzed the data after the run was completed, using npReader to extract read-times, and then sorting reads based on this information into 5 minute segments. How long does the actual analysis take? Does their approach really allow for real-time analysis? Alternatively, if I misunderstood their manuscript in this respect, the authors would do well to specify this more clearly.
REVIEWER'S RESPONSES TO QUESTIONS
Does the work include all necessary controls?
If not, please comment on the additional controls that are required.
Are the conclusions drawn adequately supported by the data shown?
If not, please explain.
Are sufficient details provided to allow replication and comparison with related analyses that may have been performed?
No: While the authors provide most of the required details, a few information are missing, which they could easily provide. This is further outlined in the comments to the authors below.
We have explained above how we have addressed these points. In general much more detail has been provided.
Does the manuscript adhere to the field standards for experimentation, nomenclature and public availability of data (or any other significant standards)? Is the software freely available, open source and with an appropriate free-to-use license?
Does the method perform better than existing methods (as demonstrated by direct comparison with available methods)?
Is the method likely to be of broad utility? Is any software component easy to install and use?
Please indicate briefly the novel features and/or advantages of the method, and/or please reference the relevant publications and which methods, if any, it should be compared with.
The main novel feature of the method is the rapid pipeline for species detection and genotyping, as well as the novel approach for strain detection by presence or absence of genes. These would be of broad utility in my assessment. As is currently common for analysis of MinION data, the pipeline appears to be a patchwork of different software tools, that requires significant (bio-) informatics skills to install and use. A more detailed review of the pipeline was impossible as the web site http://www.genomicsresearch.org/public/researcher/npAnalysis/ where the software is supposed to be provided could not be accessed at the time of writing this review.
There was a problem with the service hosting this website, which has been rectified, and also a backup solution put in place to ensure 100% uptime. We now host the scripts, documents and links to data on a github repository (https://github.com/mdcao/npAnalysis).
Is the paper of broad interest to others in the field, or of outstanding interest to a broad audience of biologists?
If yes, please explain why.
Yes: A rapid species detection and genotyping pipeline would be of considerable interest for clinicians, but could also easily be adapted to a wide range of other applications, e.g. environmental testing, forensic analysis. As such I believe this paper to be of interest to a very wide audience of biologists.