Review for "Updating the 97% identity threshold for 16S‌ ribosomal RNA OTUs"

Completed on 11 Oct 2017 by Patrick Schloss . Sourced from https://www.biorxiv.org/content/early/2017/09/21/192211.

Login to endorse this review.


Comments to author

The preprint by Robert Edgar sets out to take on the issue of what similarity threshold should be used to delineate bacterial species using partial and full-length 16S rRNA gene sequences. This is well covered territory and I'm not sure that many people would defend to the death the assertion that a 97% cutoff describes species-level taxa. It is helpful to have a discussion about the various threshold people use to bin sequences into OTUs. I think that the broader discussion and the discussion in this specific preprint in favor of a high threshold (e.g. 99.9 or 100%) has come off as being rather dogmatic. My comments below include suggestions for taking a more nuanced view. Ultimately, I think Edgar's and others' goal of pushing the field to a high threshold is an attempt to get a tool to do something it is not capable of doing. Specifically, 16S rRNA gene fragment sequences cannot delineate bacterial species and cannot tell us about phenotype. If scientists have these types of questions there are far more powerful tools at their disposal than debating the appropriate threshold for defining OTUs.

To be transparent, a considerable amount of the material that Edgar uses as a point of contrast to his work are papers that I have published over the past few years and I am the creator of mothur. As of writing this review, I have not been asked to review this manuscript for a journal, but would be happy for any editor to use my comments. Judging from the style of writing, my sense is that this preprint is unlikely to have already been submitted to a journal.

Major comments.

1. The general approach Edgar has taken is to use a variety of metrics to compare the composition of operational taxonomic units (OTUs) generated by database-independent approaches to the taxonomic assignment for those sequences. By identifying the distance that optimizes these thresholds, he arrives at the conclusion that the widely used 97% threshold is too low. Although this approach may be new, this conclusion is not (see the numerous papers published by [Tiedje and Konstantinidis](https://www.ncbi.nlm.nih.go.... I have significant concerns about his method and do not think Edgar has appropriately described the limitations of his approach. His is a problematic approach because systematicists are inconsistent in how they lump and split strains into bacterial species. From the perspecitve of the 16S rRNA gene, some species are finely split (e.g. Bacillus cereus, subtilis, anthracis) and others are lumped (e.g. Pseudomonas putida). There is broad consensus within microbiology that the 16S rRNA gene is unable to delineate bacterial species or phenotype. Furthermore, a 250 nt region of that gene is even less able to delineate a species. Considering that a minority of bacteria have actually been assigned a species-level classification, using taxonomy as the ground truth for assessing a threshold is problematic. Previous attempts have replaced the DNA-DNA hybridization approach of Stackebrandt and Goebel with genome-scale phylogenies and attempted to correlate that structure with 16S rRNA gene sequence diversity. These caveats as well as a more thorough review of attempts to find a better cutoff are warranted in a revised manuscript.

2. One of the reasons to favor a less restrictive threshold (e.g. 97%) is that there is considerable intragenomic variation in addition to considerable intraspecies variation. Using a higher threshold risks splitting sequences from the same genome into different OTUs. Previously, Edgar has indicated that he thinks this variation is the result of sequencing artifacts or contamination (see bottom of page 9, https://doi.org/10.1101/081... they are not. As an example of intragenomic variation, E. coli ATCC 70096 has 7 copies of the 16S rRNA gene and 6 of these are different from each other in the full length version of the gene. Fortunately, within the V4 region the 7 copies are identical. Alternatively, Staphylococcus aureus ATCC BAA-1718 and Staphylococcus epidermidis ATCC 12228 both have 5 copies of the 16S rRNA gene. Considering the V4 region of these species, 4 of the 5 copies in each genome are identical between the two species. The remaining S. aureus copy is 1 nt different from the other S aureus copies; however the remaining S. epidermidis copy is 1.7 and 2.0% different from the other S. epidermidis and S. aureus copies. The less restrictive threshold would lump the two species together; however, the more restrictive threshold suggested by Edgar would generate 3 OTUs. None of these reflect the biology he claims and the method would split sequences from the same strain into different OTUs. Given the ubiquity of these strains in skin-associated communities, it would make sense to take a more guarded recommendation than to make dogmatic pronouncements about using high thresholds. In the Discussion, Edgar brushes off intraspecies variation concerns and seems to ignore the case where an investigator would like to make an inference regarding the association between the relative abundance of individual OTUs and different treatment groups. Furthermore, he seems to think it would be possible to correct for the inflated alpha diversity metrics obtained by splitting sequences from the same species into different OTUs - the same seems reasonable to say about lower threshold. Although Edgar's Pcs calculations seem to account for intraspecies variation, it does not seem to factor in intrastrain variation.

3. Edgar states "Also, state-of-the-art denoisers have been shown to accurately recover biological sequences from 454 and Illumina amplicon reads (Quince et al., 2009; Callahan et al., 2016; Edgar, 2016) suggesting that the best strategy for amplicon reads is to cluster denoised sequences, in which case the clustering problem is well-modeled by error-free sequences from known species." Again, I would encourage caution in pushing these methods as the strengths and weaknesses of the approaches are not well established. Some of the methods are aggressive in removing rare sequences that may be true sequences, others seem to overfit complicated models, and as described above, others may be splitting 16S rRNA genes from the same genome into different OTUs. Furthermore, the lack of randomness in sequencing errors has not been addressed thoroughly, which creates the possibility that a spurious sequence with sufficient sequencing coverage could be treated as a new OTU rather than be folded into a similar OTU. Finally, these methods have not been well validated for the breadth of sequencing platforms that people are using. I am far more confident in the quality of sequences generated from fully overlapping 250 nt MiSeq reads for the V4 region than I am for single HiSeq reads of the V4 region. There is a trend for people to push the length of the region and throughput at the expense of quality. In short, I agree that a species likely requires a very high threshold for 16S rRNA gene sequences; however, I am not convinced by the papers he has cited that the data accumulated in the literature is of sufficient quality to trust OTUs generated with high thresholds. Combined with the reality of intragenomic variation, I see value in having a more nuanced recommendation.

4. I am happy to receive Edgar's critique regarding the methods used in mothur. I do not see how his section comparing mothur and pairwise alignments or adverse triplets helps make his points about the OTU threshold. I would suggest removing these sections unless he can find a way to tie them in better to his bigger claims - I certainly wouldn't lead off the Discussion with a critique of my use of the Matthew's Correlation Coefficient. That is a weak way to summarize his story. The following two comments will address these specific comments that, again, I do not feel have a direct connection to the goal of the paper.

A. The comparison between NAST-based profile alignments and pairwise alignments has previously been published. We too saw that pairwise alignment had smaller distances than profile alignments (doi: 10.1371/journal.pcbi.1000844 and doi: 10.1371/journal.pone.0008230). By definition, a pairwise alignment optimizes the similarity between the two sequences. In contrast, by using a profile-based alignment where the reference is aligned to the secondary structure of the 16S rRNA molecule, additional information is incorporated. This frequently increases the distance between sequences because it incorporates this extra information. I have also addressed this previously in the literature (doi: 10.1038/ismej.2012.102). I agree that the example Edgar shows is a problem. It is a well-known issue with profile alignments - if there are problems in the reference, there will be problems with the alignment. When using the SILVA reference alignment, such errors can be corrected by fixing the reference alignment. Furthermore, I would point out that an advantage of using a profile alignment like the NAST aligner in mothur is that it is considerably fast compared to a pairwise alignment. Generating pairwise alignments for N sequences would take N times longer than a profile alignment (i.e. profile alignments scale linearly while pairwise alignments scale quadratically). With large datasets pairwise alignments can be prohibitive while it only takes seconds with a profile alignment.

B. Regarding the section, "Comments on the MCCsw metric"... I readily acknowledge that because evolution does not care to conform to a similarity threshold when creating species, there will be "adverse triplets" around any threshold. As I've pointed out above, there are adverse triplets in the case of S. epidermidis V4 sequences and full length E. coli 16S rRNA gene sequences. In fact, this is why we have developed the MCC metric. It evaluates how well an algorithm balances the need to split and lump similar 16S rRNA gene sequences when assigning sequences into a bin. We have used MCC in a fundamentally different method than Edgar has in this paper. We used it assuming that the taxonomic databases are not helpful. He uses it assuming that it is the ground truth. Perhaps there is room for both views, but given the points I raised above, I am happy to stick with my approach over Edgar's.