Michael V Lombardo, Meng-Chuan Lai, Bonnie Auyeung, Rosemary Holt, Carrie Allison, Paula Smith, Bhismadev Chakrabarti, Amber Ruigrok, John Suckling, Edward Bullmore, MRC AIMS Consortium, Christine Ecker, Michael Craig, Declan Murphy, Francesa Happe, Simon Baron-Cohen
Thanks for sharing this. I'm really glad that autism researchers are starting to (a) look at cognitive heterogeneity; and (b) use preprint servers!
First, this is bugbear of mine but I find it quite unhelpful to talk about the RMET as a measure of mentalizing. In truth, it's a (relatively difficult) 4AFC test of emotion recognition. We can argue about whether learning the meanings of certain emotion words used in the test is contingent on having a fully functioning "theory of mind". But it's clear that the RMET is measuring something very different to other "mentalizing" tests in which the participant infers mental states based on the protagonists behaviour and/or events that are witnessed or described.
Second, I agree that there's potentially useful information at the item level that is lost by just totting up the number of correct items. But it's not clear to me that your study is demonstrating this to be true. In other words, what does subdividing the ASD group into "impaired" and "unimpaired" subgroups based on the clustering algorithm tells us that we wouldn't get by subdividing them according to some cut-off based on raw score? We learn that the "unimpaired" group have higher overall scores and higher VIQs, but we kind of know that already.
Third, related to the previous point, you show that a classifier trained on your subgroups in one dataset does a good job of predicting subgroup in an independent dataset; but how much of this "replicability" is driven by differences in overall performance? It would be helpful to get some more explicit details of what went into the classifier, but I assume that it's essentially providing a threshold on a weighted sum of all the items in the test. You've already shown that your subgroups (on which the classifier is trained) differ in overall performance (ie the unweighted sum of all the items). So it would be pretty odd if the classifier *didn't* perform well in a replication sample where subgroups also differed in overall performance. Indeed, in the TD group, where there aren't huge differences in overall performance, the classifier doesn't translate to the replication sample.
Hopefully my comment will help you clarify the article. I really like the approach of digging into the item-level data. At the very least I think it tells us something useful about the structure of the RMET - and which items are discriminating well between people who do versus do not have difficulties with labelling complex emotions. I'm just not convinced (yet) of some of the bolder claims you're making!
Finally, some references you may find useful:
Roach, N. W., Edwards, V. T., & Hogben, J. H. (2004). The tale is in the tail: An alternative hypothesis for psychophysical performance variability in dyslexia. PERCEPTION-LONDON-, 33(7), 817-830.
Towgood, K. J., Meuwese, J. D., Gilbert, S. J., Turner, M. S., & Burgess, P. W. (2009). Advantages of the multiple case series approach to the study of cognitive deficits in autism spectrum disorder. Neuropsychologia, 47(13), 2981-2988.
Brock, J. (2011). Commentary: complementary approaches to the developmental cognitive neuroscience of autism–reflections on Pelphrey et al.(2011). Journal of Child Psychology and Psychiatry, 52(6), 645-646.