Completed on 7 Mar 2017 by Patrick Schloss . Sourced from http://biorxiv.org/content/early/2017/02/28/112391.

Login to endorse this review.

Author response is in blue.

The preprint from Herren and McMahon describes a new metric - cohesion - to describe the overall connectedness within a community using temporal data. I was excited to see this preprint because I am familiar with McMahon's long history of developing rich time series data for microbial communities in Wisconsin lakes. I also have a lot of my own time series data from humans and mice where we struggle to incorporate time into the analysis to understand the interactions between bacterial populations.

A significant struggle in analyzing time course community data is the ability to synthesize observations for large numbers of taxa over time. Many of the existing methods people use attempt to adapt methods from cross sectional studies. For example, a study may sample a large number of lakes, people, soils, etc and characterize their microbial communities. They'll then calculate correlations across those samples based on the relative abundance of the populations. Alternatively, they'll used presence/absence data to generate co-occurrence matrices. The problem with these studies is that the next step is to often infer something about the interactions between the populations - even if the populations would never possibly co-occur. Herren and McMahon's efforts to study the connectedness of individual populations and their cohesion is very welcome because it has the potential to get us closer to describing the actual interactions between populations.

To briefly summarize the approach, the method starts by calculating the Pearson correlation between all pairs of populations across time and then discounts the correlation that would be expected if all interactions were random. This is important because of the compositional nature of the data and the effects of different population sizes. Next, the method calculates the average positive and negative corrected correlation for each population. These become the positive and negative connectedness values for each population. Finally the positive and negative cohesion values for each community is calculated by determining the sum of the product of the connectedness value and the relative abundance for that population.

The following are general critiques and questions, which I appreciate may be beyond the scope of the current manuscript (note, I am not a reviewer for the manuscript at a journal):

1. To develop the cohesion metric for a community, the authors sum over all of the populations in the community. This raised three questions for me. First, independent of the relative abundances in each sample, is the *number* of positive and negative connections for each population relevant? It might be worthwhile exploring which populations have more positive/negative connections than others. What does that distribution look like? Second, does the connectedness metric value itself have any value? What are the populations that are highly connected with other populations. Finally, the method generates a cohesion value for each time point. If I think of Lake Mendota as a community that was sampled over time, it would be interesting to know whether it has been more cohesive than Lake Monona over the 19 years of sampling. Thinking of my own work, I would be interested in knowing whether mice that are more susceptible to C. difficile colonization are less cohesive than those that are resistant. Again, this would require a composite score, not individual scores for each time point.

2. Continuing on my self-serving thread, I wonder how sensitive the method is to the time interval between samples and the number of samples. In my experiments I may have 20 daily samples from a mouse - is this sufficient? What if we miss a day - how will having a jump between points affect the metrics? As the authors state, the Lake Mendota dataset has 293 samples collected over 19 years (e.g. 1.3 samples/month). This is a very unique dataset that is unlikely to be repeated elsewhere. What if we were to get more frequent samples? What if they were more spaced out? What if we only had a year's worth of data? It would be interesting to see the authors describe how their cohesion values change when they subset the dataset to simulate more realistic sampling schemes.

3. A significant challenge in developing these types of metrics is not knowing what the true value of the metric is in nature. I appreciate Herren and McMahon's effort to validate the metrics by comparing their results to count data and to explaining the variation in Bray-Curtis distances. The manuscript reads almost like they want their method to recapitulate what is seen with those distances. But we already have Bray-Curtis distances, if that's the goal, then why do we need the cohesion metric? It would be interesting to see the authors simulate data from communities with varying levels of cohesion and abundance to see that the method gets back the expected cohesion value. Perhaps it would be possible to generate an ODE-based model to generate the data instead of variance/covariance data. There is one simulation described at the end of the Results (L300); however, it is unclear whether the lack of a meaningful R-squared value was the expected result or not.

4. Throughout the manuscript, the authors make use of parametric statistics such as Pearson's correlation coefficients and the arithmetic mean. Given that relative abundance data are generally not normally distributed and are likely zero-inflated, I wonder why the authors made these choices. I would encourage the authors to instead use Spearman correlation coefficients and median values. Related to this point, a concern with using these correlation coefficients is the problem of double zeros where two populations may be absent from the same communities. These will appear to be more correlated with each other than they really are, which is why we don't use these metrics for community comparison - we use things like Bray-Curtis. I wonder whether subtracting the null model counteracts the problem of double zeroes.

5. The authors translate their count data into relative abundance data before calculating their correlation and Bray-Curtis values. I wonder if the authors subsampled or rarefied their data to a common number of individuals. Both of these metrics are sensitive to uneven sampling. Even if the counts are converted to relative abundances, this would not remove the effects. For example, if one sample has 1000 individuals and another has 100, the limit of detection on the first would be 10-fold higher than the second. There may be populations that represent 0.5% of both communities that would not be seen in the second. If they haven't already, I would encourage the authors to subsample their dataset to a common number of individuals.

6. The "Description of datasets" section of the Methods describes the various datasets in general terms, but what is the nature of the data? How were the phytoplankton counted? How many individuals were sampled from each sample?

7. It would be great to have the code that was used made publicly available on GitHub

8. The authors present the material in a format that I have not previously seen in the microbial ecology literature (i.e. ISMEJ where this appears to be destined for review). The authors flip back and forth between presenting a different stage of the algorithm and validating that step. I think this is a bit more aligned with how one would present the material in a talk than in a paper. I've seen similar methods development described before where there might be a methods section on algorithm development and then the results section would test the assumptions and performance of the algorithm. I'm curious to see whether this structure persists through the editorial process.

Thank you for taking the time to carefully read our manuscript and provide constructive feedback. Your comments are logical follow-ups to the manuscript, and we considered many of your points when devising this workflow. We hope to make our workflow and metrics as widely accessible as possible, and appreciate the opportunity to clear up any questions about our analyses. We have also started thinking about how to incorporate user feedback into later versions of the script (e.g. adding in further options and/or parameters). If you have advice on this topic, it would be welcome. And, if you have any questions while using our method, we would be happy to answer those, as well.

Here are our responses, corresponding to your numbered comments:

1. We agree that calculating the connectedness values is one of the most critical parts of the workflow, and we considered many ways of doing this. The main alternative algorithms are discussed in the “Null Models and Sensitivity Analysis” supplementary file. We also thought it was possible that the number of positive or negative connections might influence the results. For this reason, we tried calculating the positive or negative connectedness values by dividing the sum of positive or negative correlations by the total number of pairwise correlations. This version of the connectedness algorithm could work better than the one presented in some cases. Our main evaluation criteria for settling on a connectedness algorithm were 1) whether connectedness values in the absolute and relative datasets showed strong correspondence 2) if cohesion metrics calculated from the connectedness metrics had strong predictive power of Bray-Curtis dissimilarity (please also see response to #4). The algorithm presented in the manuscript had the best performance for these two criteria when applied to many different datasets. We acknowledge your later point that microbial ecologists may be interested in response variables other than BC dissimilarity, and thus other versions of this workflow might be optimal for these different variables. For the curious user who is familiar with R, these changes are very easy to make in the cohesion R script. This kind of feedback (i.e. information on what worked best for your analysis) would be welcome input as we selectively include additional options and parameters in future script versions.

As for the question of whether connectedness has any value on its own, this is included in an upcoming second manuscript! Briefly, yes, we hypothesize that highly connected taxa act as keystone taxa in these systems. We also plan on discussing which taxa are highly connected in our study systems in future analyses. In the Mendota and Monona phytoplankton datasets, it is mainly cyanobacteria. Even though the same taxa occur in many of these lakes, they have different connectedness values among the different datasets. We interpret this to mean that ecosystem context contributes to the strength of taxon interactions.

It is more straightforward to compare connectedness values within datasets than between datasets. If datasets contain enough of the same OTUs and are from comparable environments, comparing cohesion values between datasets might make sense. However, without having analyzed such datasets, it is difficult to definitively answer this question. We are primarily concerned with how differences in evenness and richness between datasets would affect connectedness and cohesion metrics. But, we have run our analysis on the Human Microbiome Project dataset (again, with results contained in a forthcoming second manuscript). In this case, we calculated correlation values using samples from different people. Many body sites showed patterns similar to the ones presented in this manuscript, with greater cohesion corresponding to lower compositional change.

We considered some possible approaches to test whether mice susceptible to C. diff have microbial communities with lower cohesion. One method would be to calculate correlations/connectedness values using samples from many different mice, including both susceptible and non-susceptible mice. Then, cohesion values could be calculated for each mouse. However, this raises the question of whether it is appropriate to calculate correlations by pooling samples from two inherently different groups (susceptible and not susceptible). We have not tested our workflow on datasets like this, and we would need to run some diagnostics before saying whether this method would be valid. Alternately, cohesion values could be calculated for each group (susceptible and not susceptible) to see whether cohesion is a significant predictor of some dynamic in one group, but not the other. A related approach would be to calculate connectedness values from a control group of “healthy” mice, and then to calculate cohesion values for mice subjected to different experimental treatments. Finally, if long enough time series datasets are available, it would be possible to calculate connectedness and cohesion for each mouse (but see next comment for caveats).

2. This is an excellent question that we did not have space to address in the manuscript. We had conducted a sensitivity analysis that varied the persistence cutoff variable, the time elapsed between samples, and used 2 different null model randomizations. We’ll get this up on bioRxiv soon! The time elapsed between samples does affect the outcome. Phytoplankton in Peter, Paul, and Tuesday lakes were sampled at (almost exactly) weekly intervals. We ran our cohesion analysis modeling Bray-Curtis dissimilarity between points that were 2 weeks apart, 4 weeks apart, and 6 weeks apart. The highest model R2 value occurred at 6 weeks. (Model R2 values declined at time intervals > 60 days.) There are multiple possible interpretations of this result. First, Bray-Curtis dissimilarities tended to increase with time; thus, BC values were greater and had larger variance at 6 weeks than at 2 weeks. Assuming that sampling error is constant, samples taken 6 weeks apart should have a greater signal to noise ratio than samples taken 2 weeks apart. Stated another way, you might expect 2 samples taken at the same time to have a Bray-Curtis dissimilarity of ~0.1 – 0.2 due to sampling and processing stochasticity. This variability in BC values will not be explained by our cohesion variable or environmental variables. If most BC values for samples separated by 2 weeks hover around 0.3 (which is the case for the Mendota samples), then the sampling error is a substantial portion of the BC values. However, for samples with greater dissimilarity, a larger amount of variance in the values would be due to real environmental or biotic drivers, rather than sampling error. Thus, our cohesion metrics would be expected to be better predictors of these community differences. Second, one inherent assumption of this method is that enough time passes between samples for biotic interactions to affect population growth rates. It is possible that the effect of biotic interactions is only noticeable over several generations, and so the effect of cohesion is more apparent at longer time scales.

As for your dataset, we would first ask what the between-replicate variability is, in comparison to the between-day variability. If there is much greater variability between days than between replicate samples, then the cohesion metrics would have greater potential application. However, another issue that arises in time series data with closely spaced samples is autocorrelation. What is the degree of autocorrelation in OTU abundances between days? Is there autocorrelation in the BC values, such that the BC value from Day N to Day N+1 is correlated with the BC value from Day N+1 and Day N+2? If so, this might be a problem. This is discussed in the CohesionScriptReadme uploaded as supplementary material. We had started analyzing a daily time series of human stool samples, but found autocorrelation to be a sufficiently large problem to put this question on hold. However, you might have different enough data and/or experimental design to make this question tractable!

3. We chose Bray-Curtis dissimilarity as a response variable for two reasons: 1) it is very common as a response variable in microbial ecology, with many researchers are interested in predicting it, and 2) we had an a priori hypothesis that it would be related to cohesion, based on the theoretical ecology literature. It is interesting that our manuscript could be interpreted as recapitulating BC dissimilarity, because that was not the intent. We would appreciate if you could point out places where our wording was unclear, so we could clarify this during the review process.

We also considered using ODE/difference equation models to simulate datasets with known interaction strengths. However, we saw several hurdles to this approach. These included: assuming constant interactions between taxa, uncertainty about how to include environmental perturbations, high computational requirements, questions about how to specify taxon growth rates (From a normal distribution? Related to interaction rates? If so, in what ratios?), and not being able to mimic population distributions seen in microbial datasets. As a side note, this point about mimicking population distributions is particularly interesting; it is very difficult to coax a model to change its mean-variance scaling for populations. Thus, we would have needed to do post-hoc manipulation of the simulation data to match some important features of the real datasets. We decided that the uncertainties associated with these models outweighed the benefits of having specified interaction strengths.

Finally, we had hypothesized that the model R2 value would be much lower in simulated datasets, where there were no biotic interactions between taxa. In this case, the cohesion metrics would be capturing correlations between taxa that were purely noise and/or driven by environmental factors. Hopefully we will have the chance to clarify the expected result during the review process.

4. We appreciate your attention to using appropriate statistical tools for non-normal data. There are two possible routes to satisfying your questions: 1) exploring the conditions under which parametric, non-parametric, and data transformation approaches work best or 2) showing that our workflow run on relative abundance data shows strong correspondence with connectedness/cohesion values calculated from absolute abundance data. The answer via route #1 varies by characteristics of each dataset (although we have been investigating this as a part of other data analysis projects). Briefly, we tried Spearman correlations, and they worked quite poorly and gave inconsistent results across datasets. Changing the correlation method to Spearman (in both absolute and relative connectedness calculations) could give negative relationships between absolute and relative connectedness values. This is very undesirable. Re-running the Lake Mendota analysis presented in the manuscript with Spearman correlations gave the result that cohesion was a poor predictor of Bray-Curtis dissimilarity, with a model R2 value of 0.10. For Lake Monona, the same analysis gave a model R2 of 0.04.

For highly skewed and zero-inflated response variables, a generalized linear model (glm) with quasipoisson distribution would be one common option. However, this only handles skewness and zeroes in the response variable, rather than both variables. We did not want to log-transform the relative abundances because this approach would have required adding pseudo-counts to transform the zeroes, which can have a significant effect on the results. We also did not want each user to have to optimize the pseudo-counts for their dataset. As a part of a different data analysis project, we had modeled OTU abundances as a function of environmental data using either glms or linear models. For large enough sample sizes, the results of the two analyses were qualitatively similar, suggesting that the problem of non-normal residuals was mitigated when enough data were used. Thus, we decided to use Pearson correlations, acknowledging that this approach requires more samples to generate robust correlations. Finally, we considered using either median values or mean values of correlations for calculating connectedness values. We chose to use arithmetic means for connectedness because mean values of non-normally distributed data converge upon a normal distribution, if enough values are used in the averaging (per the central limit theorem). In other words, there are enough pairwise correlations that the average pairwise correlation should be a reliable indicator and not prone to skewness. Connectedness values are not normally distributed because the underlying correlation distribution of each taxon is different. Conversely, we used median values for the null model because pairwise correlations between taxa in the null model were highly skewed. (This is an expected pattern. Similarly, you can show that the null distribution for model R2 values when there is no relationship between variables is a beta distribution, which is also right-skewed. A square root transform of this beta distribution – which is the distribution for the Pearson correlations -- will also be right-skewed.) However, for all this testing, using the median versus the mean often did not have a strong effect on analysis results. Using the median versus the mean could change connectedness values slightly (usually by shifting all connectedness values slightly higher or lower), but the most highly connected taxa were reliably identified when using either median or mean values for connectedness and null model calculations.

As for satisfying your question via route #2, we checked that our workflow for the relative abundance dataset generated connectedness values that were highly correlated (either Pearson or Spearman) to connectedness values from the absolute abundance dataset. We believe that the connectedness values from the absolute dataset represent an objective standard to compare against. We investigated whether correcting the correlations using the null model improved the correspondence between the two sets of connectedness values. The null model we chose showed improvement in correspondence in some datasets and no effect in others. (This was not surprising, as the effect of relativizing data depends strongly on the evenness of the samples. Pearson correlations are expected to be more biased when samples have lower richness and/or are more uneven.) Some null models showed no effect or made the correspondence between connectedness values worse. We also evaluated how the null model affected the model R2 values in our BC dissimilarity analyses, reasoning that an appropriate null model would give a cleaner signal of connectivity and improve the model’s explanatory power (and hence the model R2 value). Still, we realize that there are datasets that differ from those we had access to, and we are open to including other options in the workflow if users find that our metrics perform poorly on their data.

Again, we stress that the averaging step in calculating connectedness is very important in the workflow. Using mean correlations assumes only that highly interacting taxa have higher correlations on average. This avoids the common problem that individual pairwise correlations are unreliable indicators of interactions.

5. We agree that rarefying affects correlations between taxa and the Bray-Curtis dissimilarities between samples. However, we argue that rarefying a dataset introduces error to both correlations and Bray-Curtis dissimilarities.

For evaluating the effect of rarefying on pairwise correlations, we used another dataset from our lab that has replicated samples (the bog lake dataset, available online at the Earth Microbiome Project website). Replicates were obtained by taking two separate filters from the same homogenized water sample. We analyzed two versions of the dataset: one that was rarefied to 5000 sequences, and one that was not rarefied. We subset each dataset to only samples that were replicated. Then, we looked at Pearson correlations for the same OTU between sample replicates. Correlations of the same population between replicate samples should, in theory, be very high; variability in OTU abundances should only arise from stochastic differences in abundances of taxa captured on the two filters and from technical processing error. We found that rarefying the dataset led to lower correlations of the same OTU in replicate samples. This effect was strongly related to mean abundance; rarefying led to weaker correlations for less abundant taxa. In the Trout Bog epilimnion dataset rarefied to 5000 sequences (and with OTUs present in fewer than 5% of samples removed), the 100 most abundant OTUs had a mean correlation of 0.93. The 101st to 200th most abundant OTUs had a mean correlation of 0.74. The 201st to 300th most abundant OTUs had a mean correlation of 0.59. In the non-rarefied dataset (with OTUs present in fewer than 5% of samples removed), the 100 most abundant OTUs had a mean correlation of 0.97. The 101st to 200th most abundant OTUs had a mean correlation of 0.87. The 201st to 300th most abundant OTUs had a mean correlation of 0.78.

The negative effect of rarefying on correlation accuracy is also expected theoretically. Rarefying to 5000 sequences (with replacement) is equivalent to drawing 5000 samples from a multinomial distribution with probabilities defined by the original (non-rarefied) taxon proportions. The variance of a given OTU (say, OTU1) in repeated rarefied samples from the same original sample is n*p*(1 – p) where n is the number of total sequences subsampled and p is the proportion of OTU1 in the original non-rarefied sample. The mean, or expected, abundance of OTU1 in a rarefied sample is n*p. Thus, the added variability due to rarefying for any OTU is directly related to its original relative abundance. The denominator in the equation for calculating a Pearson correlation between two populations (say, OTU1 and OTU2) is the product of the standard deviations of the two populations. Rarefying increases the standard deviation of OTU distributions because each OTU abundance value is drawn from a multinomial distribution, which adds in further variability. Thus, this product of the two OTU standard deviations increases, leading to a greater denominator in the Pearson correlation calculation. You can investigate this with the following code, which gives the standard deviation of an original vector and the standard deviation of that vector after being “rarefied.” We draw from a binomial distribution because each variable sampled from a multinomial distribution will be binomially distributed.

#orig.prop is the original relative abundance of an OTU that is Poisson distributed and occurs in samples where the total number of individuals (or cells or sequences) is also Poisson distributed. This could easily be done with a normal or lognormal distribution instead.

orig.prop <- rpois(100, 20) / rpois(100, 2000)

sd(orig.prop)

seq.number <- 1000

rare.sample <- rbinom(100, size = seq.number, prob = orig.prop)

rare.prop <- rare.sample / seq.number

sd(rare.prop)

Meanwhile, the numerator in the Pearson correlation coefficient (the covariance between OTU1 and OTU2) is largely unaffected by rarefying. This is because 1) the expected relative abundance of each taxon remains the same between the original and rarefied datasets and 2) the population dynamics of OTU1 and OTU2 do not become any more synchronous due to rarefying.

Rarefying without replacement results in individual taxon abundances being hypergeometrically distributed. The same principles of increased taxon variability and a larger effect on less abundant taxa can be demonstrated with the hypergeometric distribution, as well. And, when many sequences are subsampled, the hypergeometric distribution converges to the binomial distribution.

These expected patterns in pairwise covariances and taxon standard deviations are easily investigated in empirical data. We calculated the pairwise covariances of taxa in the original, non-rarefied Mendota phytoplankton dataset from the Mendota dataset rarefied to the minimum observed cell density. The ratios of the pairwise covariances calculated from the original and rarefied datasets were approximately symmetrically distributed around 1, indicating no systematic bias in pairwise covariances from rarefying. Conversely, ratios of taxon standard deviations (sd of taxon1 in the original dataset / sd of taxon1 in the rarefied dataset) showed a skewed distribution toward larger standard deviations in the rarefied dataset. Thus, Pearson correlations decrease in rarefied datasets because the denominator in the equation is expected to increase, while the numerator is unchanged. The stochastic nature of rarefying can cause correlations to increase, but the overall trend is for correlations to weaken in rarefied datasets. Thus, knowing that rarefying decreases strong correlations between taxa and has disproportionate effects on less abundant taxa, we chose to not rarefy in our workflow.

This same issue of increased population variability when rarefying also comes up with Bray-Curtis dissimilarity. Rarefying is inherently stochastic, so two rarefied samples from the same original sample will have a BC dissimilarity > 0. This expected BC dissimilarity is largely dependent on the depth of rarefying (please see citation 1 below). Two rarefied samples can easily have a greater Bray-Curtis dissimilarity to each other than between the rarefied sample and the original sample. When BC dissimilarities are calculated on the relative abundances, a rare taxon appearing in a sample with millions of sequences will have a miniscule effect on BC dissimilarity. Thus, we believe that the variability introduced into BC dissimilarities as a result of rarefying outweighed the benefit of removing rare taxa from samples with greater cell counts. There are some analyses for which we would advocate for rarefying; such instances include comparing alpha diversity, analyses that assume constant sampling effort, and analyses using presence/absence data.

With enough samples, these issues are largely moot. We re-ran our Lake Mendota analysis after rarefying the phytoplankton dataset to the minimum observed cell concentration, and our final analysis modeling Bray-Curtis dissimilarity as a function of cohesion was nearly identical. The model R2 value was within 0.01 of the original model R2. Again, negative cohesion was still extremely significant, while positive cohesion was not significant. We repeated this analysis several times to account for the stochastic nature of rarefying, and results were consistent.

1: Gülay, Arda, and Barth F. Smets. "An improved method to set significance thresholds for β diversity testing in microbial community comparisons." Environmental microbiology 17.9 (2015): 3154-3167.

Trina says she would love to hear more of your thoughts about rarefying, in the context of the above.

6. This is the link to the phytoplankton counting protocols:

https://lter.limnology.wisc...

In the protocols described, phytoplankton samples were mounted permanently on microscope slides. Random fields were counted until a minimum of 400 “natural units” were encountered. However, the protocol states that their tiered counting method should result in counts of well over 400 natural units. Variability in cell density is also factored into the protocol; an additional criterion is that enough fields are counted such that the standard error of the mean number of units per field is below 10%. We would have liked to give more details about the cell counting methods, but we already slightly exceeded the maximum advised word count.

7. This is in the works! Currently, the simulation code, cohesion script, and the phytoplankton table used in the analysis are included as supplementary material.

We also realized that we will likely update the cohesion script after getting user feedback. We would appreciate any advice you have on using Github to obtain and incorporate suggestions.

8. Yes, the format is slightly unusual. We decided on this layout because we viewed the validation steps as separate analyses, rather than simply testing assumptions.