Irineo Cabreros, John D. Storey

Review posted on 06th April 2018

The main contribution of this paper

is to introduce an intuitive, fast, and relatively simple algorithm

for fitting the "Structure" (or "Admixture") model from population genetics.

I found the paper very interesting. I also congratulate the authors

on the R package, which is generally well documented and

I was able to get running easily.

However, I also found aspects of the presentation rather confusing

(starting with the title!)

and the comparisons need some work to make this publishable.

Major Comments:

1. Presentation:

The paper calls this a "nonparametric" estimator of population structure,

that "unifies" admixture models (AMs) and PCA. I think this misses the mark on

many fronts. Effectively the algorithms described here fit the AM,

using a least squares formulation rather than the usual binomial likelihood.

For example, note that the parameters P and Q being estimated here have the same interpretation,

and satisfy the same constraints, as the P and Q parameters in the AM.

There is nothing new on unification with PCA here - the unifying view

that both PCA and AM can be viewed as matrix factorization

problems has already been pointed out by previous authors (eg Engelhardt and Stephens,

cited here). The authors should remove the expository material related to the "unification"

(none of which is key to the paper), and just cite previous work as needed.

This will provide them the chance to really emphasize what I think

is the true (and important) contribution

here, which is a novel (and fast, and simple) algorithm for fitting the

AM by least squares.

(I understand that, computationally,

PCA plays a role in the actual algorithm presented. But its use here is

simply a computational device, as a step toward fitting the AM.

Claiming that this unifies the two is just confusing.)

2. Comparisons

Given the above comments, it is natural to ask how the algorithm here

compares with previous algorithms. And indeed, the current Results section

focusses on exactly this question, in terms of both estimates of the parameters

and in terms of speed. But there is one part missing: an assessment of

how the algorithms compare in the value of the objective functions achieved.

This is complicated slightly by there being 2 different objective functions:

one based on the Frobenius norm ||X-2PQ||_2 and another based on the

binomial likelihood. But this is easily addressed by examining

both objective functions.

It would be particularly interesting to see whether the algorithm here can actually

achieve a better binomial likelihood value than the approaches that are

directly based on optimizing the binomial likelihood. But in any case,

one would hope that the algorithm here will be better at least in Frobenius

norm.

3. The final algorithm is a two stage algorithm, that first does a dimension

reduction (yielding Fhat) and then uses a "projected-ALS"

algorithm to minimize ||Fhat - PQ||_2 subject to

constraints on P and Q.

My comment here is that the paper spends a lot of space and statistical

discussion on the first step, obtaining Fhat,

and yet the need for it is far from obvious.

Why is this better than directly applying the projected ALS algorithm to

minimize ||X-2PQ||_2 ? (which already provides a dimension reduction in itself)

I did some very preliminary numerical simulations

using the authors software, which you can see at

https://stephens999.github.io/misc/ALStruct.html

In these limited investigations I found the two

approaches gave similar values for ||X-2PQ||_2, but

the dimension reduction step seemed to improve speed quite a bit in one simulation.

It was not clear to me if that is expected generally, and from the paper

it does not come across as the main motivation for the dimension

reduction.

I think the authors need to be clear about why the need

for the dimension reduction - if it is primarily computational

speed then a lot of the discussion given seems unnecessary,

and needs replacing with a demonstration that indeed the speed is improved.

If it is to improve estimation accuracy compared with directly

minimizing ||X-2PQ||_2 then the improvement in accuracy should be

demonstrated numerically and some additional explanation for

why it improves accuracy given.

Other comments:

-p5: I suggest presenting the admixture model as being based on E(X) = 2PQ,

This leads naturally to a least-squares formulation

which, I believe, is what your algorithm essentially targets.

-p6: the description here needs rewriting. You do not

assume Fhat = PQ, because you use a least squares formulation (15)

that does not impose this assumption.

And describing (5) as a constraint is unhelpful because you

do not know Q so it cannot be applied! If anything, (5) is an assumption

rather than a constraint. But I'm not sure you even need it.

My suggestion is to rephrase your approach. Motivate it by introducing

the following problem: minimize ||X-2PQ|| subject to the

constraints (2)-(4). Then outline how this leads to your 2-step algorithm

which first replaces X with Fhat, and then minimizes ||Fhat-2PQ||.

-p10: what is variance(f_{ij})? Isn't f a constant in your treatment,

not a random variable? I'm confused.

-p13: you should make it clear that convergence to a stationary point

is far from a guarantee that it converges to a good solution -

the surface can be quite multi-modal and badly behaved.

- Fig 7: Focussing on just the first 2 rows of Q seems dangerous.

Why not just show the usual structure barplots here, which summarize

all of Q.

-p24: The sentences "As such... constraints" and "It is a simple, but central..."

seem like they relate to a different paper! The replacement of

the equality constraint (13) with a least squares criterion (15) makes both

these statements false, surely?

Details:

- p4 "not supported..." there are probabilistic interpretations of PCA

(eg Tipping and Bishop) that seem to contradict this.

- p5: I find introducing F as the "individual-specific allele frequencies"

confusing. I suggest first introducing F as the binomial parameter,

and then - if you need to -

saying that F can be thought of as the "individual-specific allele frequencies"

(although I'm not sure it is that helpful?)

-p5: the footnote appears false; PSD includes more general

non-uniform priors on both P and Q (eg see their equation 7)

- p6: "frequentist likelihood-based" is confusing. If you mean maximum

likelihood, just say maximum likelihood (which is not an especially

frequentist procedure.) By the same token "in the frequentist setting"

can be removed.

-p12: "In the case of anchor SNPs" - here you should rephrase to make

clear that you are defining what an anchor SNP is.

-p14: "Unconstrained ALS" - this is not unconstrained! How about "Projected ALS"?

Software:

In the function alstructure you have parameters P_init and Q_init that are not used.

It says "Only available for cALS method." but this is not available as a method

in this function? (I think it would be useful to be able to pass in initial values like this)

Maxwell Wing Libbrecht , Oscar Rodriguez , Zhiping Weng , Michael Hoffman , Jeffrey Bilmes and William Stafford Noble

Review posted on 16th December 2016

This paper presents a new strategy for annotating the human genome into different “chromatin states” (quiescent, promoter, enhancer etc), and visualizing the results. The new strategy involves first applying a segmentation algorithm, here the Segway algorithm, separately to each available cell type. The Segway algorithm uses molecular data on a cell type (eg histone marks, DNase hypersensitivity) to segment the genome into what might be called ``Segway states”, essentially by clustering together regions that show similar patterns in the molecular data. The paper then uses a random forest classifier to map the Segway states in each cell type to a set of canonical annotations (quiescent, promoter, enhancer etc), based on how well the data in each Segway state match these canonical annotations from previous human-guided annotations. The result is that each cell type produces an annotation of the genome that uses the same set of canonical annotations, which facilitates pooling the results across all cell types.

A key innovation is to automate the mapping from Segway states to canonical annotations, using a random forest. In previous applications of these kinds of method this mapping step has been done by human interpreters. The automation of this step greatly facilitates analysis of many cell types (here 164). Another innovation is the introduction of what the paper calls a “functionality score” to visually weight some of the canonical annotations more heavily than others, depending on how evolutionarily conserved those annotations tend to be. This results in visual upweighting of promoter-annotated regions compared with other annotations for example.

The automated interpretation strategy is simple and potentially useful. And the production of large scale functional genomic annotations, and ways to visualize them, could certainly be a useful resource for the community. However, the utility of the resource is inevitably dependent on the quality of the final annotations, and the main limitation of this work is that it provides very little data to support or validate the quality of the annotations. The enrichment of GWAS hits within regions with high functionality score is the only objective assessment provided, and this assessment is fairly cursory.

To be fair, objective assessment of annotations is not easy. Nonetheless, without this it is hard to judge how useful the annotations are. The GWAS comparisons could usefully be expanded. For example, comparisons with other annotations would help: GWAS hits are also enriched in high DNase regions and near genes; do these annotations have a higher or lower enrichment than functionality score? Can you support your claim that “the functionality score (and the Segway encyclopedia) is the most effective tool for understanding what the function of a known important variant (such as a disease allele).”, by giving an example? Perhaps data on eQTLs could also help support the usefulness of annotations - for example, in cases where annotations differ among cell types, are these predictive of differences in eQTL behaviour? Or perhaps it could be demonstrated that differences in annotations among cell types are predictive of expression. In Figure 4a some cell types have the gene body annotated as “transcribed” whereas others have it annotated as “quiescent”. Is this reflected in differences in expression among cell types? It would greatly improve the impact of the work if the quality and utility of the annotations were better supported.

Detailed comments:

- The various ways that the paper switches between terminologies like “labels”, “states”, “annotations” etc is quite confusing. It would greatly help presentation to be consistent about the terms used. For example, to use “states” to refer to the output of Segway, and “annotations” to refer to the canonical annotations (promoter, enhancer, transcribed etc).

- The paper introduces modifications to the existing Segway algorithm, which the authors presumably believe to be improvements, but no data is provided to support this. For example, almost no motivation is provided for switching to a mixture of Gaussians from a single Gaussian, except that the single Gaussian “artifactually divided” weak transcription vs. transcription. Can you demonstrate this? Is it really an artifact? It would seem to us that there will in practice be a quantitative continuum of transcription states and not simply transcribed/not-transcribed. The mini-batch idea makes sense, but again it would seem helpful to demonstrate the benefit (with only 25 iterations it seems that you will use at most 25% of the data; does more iterations help?).

The definition of “Encyclopedia segment” seems extremely subjective (the parameters were chosen so that “resulting segments matched our intuition about the size and frequency of functional regulatory elements”), and consequently of questionable utility.

- While annotating each cell type separately is certainly convenient, it also has some disadvantages. It will result in less accurate annotations than a joint approach when the annotations are actually consistent across cell types. And any errors in annotations will tend to artificially inflate differences among cell types, which users may be tempted to overinterpret. Some discussion of these issues, and the difficulty of assessing significance and reliability of differences, seems in order.

- The claim that “the functionality score (and the Segway encyclopedia) is the most effective tool for understanding the function of a known important variant (such as a disease allele).” needs support.

- The internet Encyclopedia interface did not work for us: the first example we tried was chr10:73576055-73611126 and this gave an error “Error: Traceback (most recent call last): File "functionality_score_plot.py", line 108, in ann_labels_frames = ann_labels_mat.reshape(num_frames,resolution,len(ann_celltypes))[:,0,:].astype("int8") ValueError: total size of new array must be unchanged”. Did we do something wrong or is this a bug?

- The Encyclopedia interface would also benefit from being more flexible. For example allowing to search by gene name, or putting in positions in one string (chr10:73576055-73611126) rather than three separate boxes.

Minor comments:

- The schematic pipeline in Figure 1 is hard to follow. What does "compare to known phenomena" mean and what are "label features" here? Why do you use Exon as an example annotation when that is not one of the annotations you consider? (And why did you decide not to annotate exons as exons? That would seem like a useful annotation.)

- The definitions of LowConfidence vs. Unclassified and the distinction is unclear.

- The description of the categories in section 2.3 is a quite long and largely restates previous findings.

- It seems that the annotations chosen are not mutually exclusive. For example, the first exon is highly enriched with promoter, enhancer, and bivalent categories (figure 3a), but is also presumably transcribed.

- The classifier used 17 features, but why we have only 16 features In figure 2b?

- How are the columns in figure 3b ordered?

- Excluding RNA-seq data because you are interested in transcriptional regulation seems like a weird justification (p3).

- P4, it is unclear where the number 294 came from here.

- How was panel a of Figure 3 computed?

- If the functionality score is the 75th percentile of a distribution, then it should just be a number. How does it itself have a distribution (Figure 3d)? Is this the distribution across the cell types? Please clarify.

- “GWAS cannot disentangle genetic linkage” - this seems misleading (by the classical definition of linkage) because LD actually decays much faster than linkage, and indeed its increase resolution is one of the advantages of GWAS over genetic linkage analysis.

This review is signed:

Matthew Stephens

Kevin (Kaixuan) Luo