Completed on 6 Apr 2018 by Matthew Stephens .

Login to endorse this review.

The main contribution of this paper

is to introduce an intuitive, fast, and relatively simple algorithm

for fitting the "Structure" (or "Admixture") model from population genetics.

I found the paper very interesting. I also congratulate the authors

on the R package, which is generally well documented and

I was able to get running easily.

However, I also found aspects of the presentation rather confusing

(starting with the title!)

and the comparisons need some work to make this publishable.

Major Comments:

1. Presentation:

The paper calls this a "nonparametric" estimator of population structure,

that "unifies" admixture models (AMs) and PCA. I think this misses the mark on

many fronts. Effectively the algorithms described here fit the AM,

using a least squares formulation rather than the usual binomial likelihood.

For example, note that the parameters P and Q being estimated here have the same interpretation,

and satisfy the same constraints, as the P and Q parameters in the AM.

There is nothing new on unification with PCA here - the unifying view

that both PCA and AM can be viewed as matrix factorization

problems has already been pointed out by previous authors (eg Engelhardt and Stephens,

cited here). The authors should remove the expository material related to the "unification"

(none of which is key to the paper), and just cite previous work as needed.

This will provide them the chance to really emphasize what I think

is the true (and important) contribution

here, which is a novel (and fast, and simple) algorithm for fitting the

AM by least squares.

(I understand that, computationally,

PCA plays a role in the actual algorithm presented. But its use here is

simply a computational device, as a step toward fitting the AM.

Claiming that this unifies the two is just confusing.)

2. Comparisons

Given the above comments, it is natural to ask how the algorithm here

compares with previous algorithms. And indeed, the current Results section

focusses on exactly this question, in terms of both estimates of the parameters

and in terms of speed. But there is one part missing: an assessment of

how the algorithms compare in the value of the objective functions achieved.

This is complicated slightly by there being 2 different objective functions:

one based on the Frobenius norm ||X-2PQ||_2 and another based on the

binomial likelihood. But this is easily addressed by examining

both objective functions.

It would be particularly interesting to see whether the algorithm here can actually

achieve a better binomial likelihood value than the approaches that are

directly based on optimizing the binomial likelihood. But in any case,

one would hope that the algorithm here will be better at least in Frobenius

norm.

3. The final algorithm is a two stage algorithm, that first does a dimension

reduction (yielding Fhat) and then uses a "projected-ALS"

algorithm to minimize ||Fhat - PQ||_2 subject to

constraints on P and Q.

My comment here is that the paper spends a lot of space and statistical

discussion on the first step, obtaining Fhat,

and yet the need for it is far from obvious.

Why is this better than directly applying the projected ALS algorithm to

minimize ||X-2PQ||_2 ? (which already provides a dimension reduction in itself)

I did some very preliminary numerical simulations

using the authors software, which you can see at

https://stephens999.github.io/misc/ALStruct.html

In these limited investigations I found the two

approaches gave similar values for ||X-2PQ||_2, but

the dimension reduction step seemed to improve speed quite a bit in one simulation.

It was not clear to me if that is expected generally, and from the paper

it does not come across as the main motivation for the dimension

reduction.

I think the authors need to be clear about why the need

for the dimension reduction - if it is primarily computational

speed then a lot of the discussion given seems unnecessary,

and needs replacing with a demonstration that indeed the speed is improved.

If it is to improve estimation accuracy compared with directly

minimizing ||X-2PQ||_2 then the improvement in accuracy should be

demonstrated numerically and some additional explanation for

why it improves accuracy given.

Other comments:

-p5: I suggest presenting the admixture model as being based on E(X) = 2PQ,

This leads naturally to a least-squares formulation

which, I believe, is what your algorithm essentially targets.

-p6: the description here needs rewriting. You do not

assume Fhat = PQ, because you use a least squares formulation (15)

that does not impose this assumption.

And describing (5) as a constraint is unhelpful because you

do not know Q so it cannot be applied! If anything, (5) is an assumption

rather than a constraint. But I'm not sure you even need it.

My suggestion is to rephrase your approach. Motivate it by introducing

the following problem: minimize ||X-2PQ|| subject to the

constraints (2)-(4). Then outline how this leads to your 2-step algorithm

which first replaces X with Fhat, and then minimizes ||Fhat-2PQ||.

-p10: what is variance(f_{ij})? Isn't f a constant in your treatment,

not a random variable? I'm confused.

-p13: you should make it clear that convergence to a stationary point

is far from a guarantee that it converges to a good solution -

the surface can be quite multi-modal and badly behaved.

- Fig 7: Focussing on just the first 2 rows of Q seems dangerous.

Why not just show the usual structure barplots here, which summarize

all of Q.

-p24: The sentences "As such... constraints" and "It is a simple, but central..."

seem like they relate to a different paper! The replacement of

the equality constraint (13) with a least squares criterion (15) makes both

these statements false, surely?

Details:

- p4 "not supported..." there are probabilistic interpretations of PCA

(eg Tipping and Bishop) that seem to contradict this.

- p5: I find introducing F as the "individual-specific allele frequencies"

confusing. I suggest first introducing F as the binomial parameter,

and then - if you need to -

saying that F can be thought of as the "individual-specific allele frequencies"

(although I'm not sure it is that helpful?)

-p5: the footnote appears false; PSD includes more general

non-uniform priors on both P and Q (eg see their equation 7)

- p6: "frequentist likelihood-based" is confusing. If you mean maximum

likelihood, just say maximum likelihood (which is not an especially

frequentist procedure.) By the same token "in the frequentist setting"

can be removed.

-p12: "In the case of anchor SNPs" - here you should rephrase to make

clear that you are defining what an anchor SNP is.

-p14: "Unconstrained ALS" - this is not unconstrained! How about "Projected ALS"?

Software:

In the function alstructure you have parameters P_init and Q_init that are not used.

It says "Only available for cALS method." but this is not available as a method

in this function? (I think it would be useful to be able to pass in initial values like this)