Denes Szucs, John PA Ioannidis

Review posted on 26th August 2016

Could you please clarify how exactly your effect-size-extraction algorithm used the text of any given manuscript to calculate a standard effect size (i.e. which parameters and values were put into which formula/s), particularly given the following arguments?

As you state in the Introduction section, power estimates and NHST are somewhat mutilated constructs that came out of Fisher's and significance testing approach and the Nyman-Pearson theory. In most of the (current) neuroscience literature, significance is not merely assessed by the (raw) statistic, but instead a "combined height-and-size thresholding" procedure is applied.

This combined approach typically results in a "significant finding" if (and only if) a contiguous set of voxels (i.e. a cluster) in its entirety surpasses a "cluster-forming threshold" (uncorrected p-value, typically set between one-tailed p<0.01 and one-tailed p<0.001, including those boundary values) *AND* when the cluster at least is of a size (typically measured in number of voxels) that is determined via one of several ways, including Gaussian Random Fields theory (e.g. in SPM), simulating noise data with a separately estimated smoothness (e.g. in AFNI's AlphaSim), or via permutation testing (non-parametrically, e.g. in SnPM).

The relevant statistic to assess its "significance" (related to the likelihood of detection and power!), really, is then the p-value of a cluster (not its peak or average t-value). For example, if any given test (such as a two-sample t-test with group sizes N1=12 and N2=12, for a d.f. of 22) is performed (and a whole-brain search is reported, i.e. no spatial prior is used!), a typical manuscript would then contain a table of all clusters (and their peak coordinates and sizes) that reach this combined threshold, as an example, the authors may have chosen to apply a CDT of p<0.001 (t[d.f.=22] >3.505), and with an assumed voxel size of isotropic 3mm (27 cubic mm/voxel), and a smoothness estimate of 13.5mm, an application of the AlphaSim algorithm would lead to a required cluster size of approximately 73 voxels (i.e. if in a contiguous region of space spanning at least 73 3mm-cubed voxels inside the brain mask all voxels surpass a t-threshold of 3.505, this region would be considered "significantly different" between the groups).

Please then be aware that for this "activation difference", the p-value is still *just* 5% (i.e. the chance-level of finding such a cluster size given the t-threshold in noise conditions is 1 in 20). That being said, I think it is then equally fair to ACTUALLY compute the standardized effect size for such a cluster from a t-value of 1.717 (d.f.=22; one-tailed p<0.05).

In case your algorithm uses the (maximum or mean) t-value reported for the cluster, it is only natural that the standardized effect size distribution MUST be skewed towards a region that is not in line with traditional psychological research (given the massive application of family-wise-error correction procedures due to the mass-univariate testing approach, still dominant in the field).

As a summary prescription and recommendation for the literature/field, I would consider urging authors of future publications to always report cluster-level p-value estimates (which can be assessed in both GRF and simulation methods by comparing the observed cluster size of any cluster reaching significance with the distribution of cluster sizes under the NULL), such that effect sizes, at least when it comes to using them for the purpose of power estimates, can be computed appropriately.

Thank you so much for your consideration and efforts!