Review for "Science with no fiction: measuring the veracity of scientific reports by citation analysis"

Completed on 11 Sep 2017 by Naomi Penfold . Sourced from http://www.biorxiv.org/content/early/2017/08/09/172940.

Login to endorse this review.


Comments to author

Author response is in blue.

The idea of classifying hypotheses as supported or refuted by ongoing works, as a means to identify "strongly supported" or "strongly refuted" claims is an interesting one. I would like to see further discussion of how this could be applied.

Namely, it seems the R-factor is something that should be applied to a specific scientific claim, as opposed to a whole research article. Being able to quickly identify the evidence that supports (green), refutes (red) or relates unclearly (yellow) to a claim, directly from the claim in said literature, could aid comprehension (not to mention discoverability) of the surrounding literature, and highlight claims that are well-supported or lacking in independent replications. Do the authors feel that one paper is sufficiently related to one central claim for application of the R-factor the the paper? Alternatively, I would argue that judging the "veracity" of component evidence presented within an article could be more informative.

Further, limiting these data to the "cited by" literature from that paper could skew the perspective, depending on which article you are viewing the claim in - to understand the overall "veracity" of a claim, it seems the reader would need to navigate back to the first mention of that claim in order to find the longest chain of evidence. Instead, I would be interested to explore the feasibility of a claim-centric (as opposed to paper-centric) count, and to understand whether this is already achieved by existing practises (such as meta-analyses of the literature). Perhaps an alternative approach would be to ensure that meta-analyses that include an article are more clearly visible from that article (e.g. highlighted in a "cited-by" section), and an extension to that would be to link that more recent work to the specific assertions that it relates to in the current article.

I would also be interested in whether the authors' have any thoughts on the reporting bias towards positive results (it may be hard to judge replicability, if failed replications remain in desk drawers), as well as on more nuanced evaluations of related evidence: is some evidence stronger than others? Is it feasible to define a scientific claim, or is it dependent on context/species/other factors?

Finally, I would be concerned about applying such a metric to individual researchers. An examination of unintended consequences for such a metric would be useful to discuss.

Competing interests: None identified.
Full disclosure: I am Innovation Officer at eLife. My academic background is not in meta-research of reproducibility.



Dear Naomi,

Thank you very much for your insightful comment and for your interest in the R-factor.

Please let us reply point by point.

“The idea of classifying hypotheses as supported or refuted by ongoing works, as a means to identify "strongly supported" or "strongly refuted" claims is an interesting one. I would like to see further discussion of how this could be applied.”

Thank you for considering our idea interesting! We will be happy to discuss it.

We would prefer to avoid qualifiers, such as “strongly”, because what is strong evidence for one scientist can be nonsense to another, as many a scientific discussion or a set of opposing reviews would testify.

“Namely, it seems the R-factor is something that should be applied to a specific scientific claim, as opposed to a whole research article. Being able to quickly identify the evidence that supports (green), refutes (red) or relates unclearly (yellow) to a claim, directly from the claim in said literature, could aid comprehension (not to mention discoverability) of the surrounding literature, and highlight claims that are well-supported or lacking in independent replications. Do the authors feel that one paper is sufficiently related to one central claim for application of the R-factor the paper? Alternatively, I would argue that judging the "veracity" of component evidence presented within an article could be more informative.”

We agree completely and tried to emphasize the focus on a claim as a unit of evaluation in our preprint. We will update the preprint to articulate this focus explicitly. Whether an article would have one claim or more depends on the report. In the latter case, applying the R-factor to all claims would be reasonable.

In the examples mentioned in our preprint and in our current research we deal with the main claims because these claims are commonly articulated in the titles of the articles by their authors. This choice minimizes the possibility of misunderstanding what the authors concluded and facilitates the automation of identifying what an article claims.

“Further, limiting these data to the "cited by" literature from that paper could skew the perspective, depending on which article you are viewing the claim in - to understand the overall "veracity" of a claim, it seems the reader would need to navigate back to the first mention of that claim in order to find the longest chain of evidence. Instead, I would be interested to explore the feasibility of a claim-centric (as opposed to paper-centric) count, and to understand whether this is already achieved by existing practises (such as meta-analyses of the literature). Perhaps an alternative approach would be to ensure that meta-analyses that include an article are more clearly visible from that article (e.g. highlighted in a "cited-by" section), and an extension to that would be to link that more recent work to the specific assertions that it relates to in the current article.”

The point about the “trees” of evidence for a claim is indeed excellent! We envision that these trees will be extractable from the R-factor resource and would be one of its most powerful features, enabling the user to grasp quickly the history of the claim and thus the novelty or the lack thereof of the articles referring to it. We are beginning to build a prototype of the “tree viewer”, which we call the Linker: http://bit.do/mock_up (you can zoom in and out, click on the links and nodes, and move around the graph). We keep in mind the century long history of ignoring the claim that the ulcer disease is caused by a bacterium as an example of how a timely reconstruction of the “trees” of evidence could help accelerate discovery.

“I would also be interested in whether the authors' have any thoughts on the reporting bias towards positive results (it may be hard to judge replicability, if failed replications remain in desk drawers), as well as on more nuanced evaluations of related evidence: is some evidence stronger than others? Is it feasible to define a scientific claim, or is it dependent on context/species/other factors?”

Indeed, the R-factor can only reflect what scientists have published, which means that the results that are now in the drawers would not be considered. However, we anticipate that the use of the R-factor and the increasing popularity of preprints can change this. Currently “negative” results stay in the drawers because the value of reporting them is uncertain while the effort of reporting them is substantial. We think that the opportunity to affect the R-factor of a praised paper that everyone in the field knows is wrong and the ease of reporting the results through a preprint service like bioRxiv would help to keep the drawers empty.

We would like to emphasize that the R-factor of a claim does not measure the replicability of the study that reported it, but whether the claim has been confirmed. For example, testing a claim in a different experimental system, which is a common practice, is not a replication by definition and thus the result of such testing would be missed by the replication approaches, but would be included in the R-factor. Likewise, a replication study can test whether the reported result is reproducible, but not whether it is misinterpreted. The R-factor evaluates the chance that a scientific claim, which is an interpretation of the results, is correct, irrespective of whether this claim is based on valid results, a guess, or misunderstanding, which all have their role in science.

“Finally, I would be concerned about applying such a metric to individual researchers. An examination of unintended consequences for such a metric would be useful to discuss.”

We welcome this discussion, but the R-factor of researchers will be derived from the R-factors of their reports by extension. We do not see how this extension can be blocked and question whether it should be blocked. We would suggest that an open and transparent score could be better than a reputation based on grapevine, the membership in the old boys clubs, or on unqualified praise in the media. We envision that once the dust settles, people will see in numbers what they already know intuitively, namely that no one is perfect in their scientific judgment and that some outliers on both sides of the distribution are present. We would also like to emphasize that the R-factor will be but one of the measures used to evaluate scientists and hope that non-quantifiable evaluating criteria will also stay in place.

Thanks again for your insightful comments, which made us think and wish to discuss the issues you raised further.

Best regards,

The Verum team.