Data police force will help clean up research

An independent service to check data and algorithms is the only way to resolve the research replication crisis, say Christophe Pérignon and Christophe Hurlin

August 26, 2019
data manipulation

In many scientific papers, the core of the analysis is computational. Researchers spend months – sometimes years – collecting and cleaning data, writing and debugging computer code, and then running and rerunning their work. Yet those data and code never enter the peer-review process. No wonder, you might argue, that reproducibility is not the norm in modern science.

When reviewing a manuscript, journal editors and referees have traditionally had to assume that the results outlined are the genuine output from running the researchers’ computer code on their data. Over the past decade, some journals have begun to instruct authors to upload their code and data to dedicated online repositories after the acceptance of the paper, so that, in principle, other researchers can download all the necessary resources to redo their analysis. However, such initiatives have been only partially successful in improving transparency.

There are two main reasons. First, the posted code and data are not checked systematically. Their quality, therefore, is sometimes low – particularly because researchers lack time and incentives to prepare them properly. This makes it hard even for specialists to redo the analysis and fully reproduce an original study.

Second, an increasing number of academic papers rely on confidential data relating to individuals; examples include data on income, employment, taxes and health. These are available only to accredited users within a secure computing environment and cannot be shared. In some cases, an anonymised version of the data can be made public, but recent evidence suggests that this approach is not yet able to provide a guarantee that privacy is preserved. A paper recently published in Nature Communications shows that 99.98 per cent of Americans can be identified from any anonymised dataset with as few as 15 attributes, such as gender, zip code or marital status.

ADVERTISEMENT

That well-trained researchers are sometimes unable to replicate the results of papers published in their field is a serious concern and calls for action. Some academic journals take the issue very seriously and rerun authors’ code on their data to check for reproducibility. The journal Biostatistics has been implementing such a verification process for several years, and the American Economic Review recently announced that it is about to do the same. Many journals, however, lack the time or specialised staff to deal with numerous software and data sources.

As an alternative, we advocate an external solution provided by a specialised certification agency, acting as a trusted third party. To this end, we recently launched cascad, the Certification Agency for Scientific Code and Data, as a non-profit academic initiative.

ADVERTISEMENT

When a researcher requests a reproducibility certificate, a cascad reviewer runs their code on their data to verify that the output corresponds to the results presented in the tables and figures in their manuscript. The certificate can then be submitted to journals alongside the manuscript, giving the editor and reviewers confidence that the paper is all that it seems.

Another key advantage of a trusted third party is its ability to certify the reproducibility of research based on confidential data. For instance, as shown in a recent publication in Science, cascad not long ago partnered with France’s Secure Data Access Centre, a public body that allows researchers to access and work with confidential governmental data under secure conditions. The centre creates a virtual machine allowing researchers to remotely access the specific datasets needed for their projects, as well as the required statistical software. The cascad reproducibility reviewer then accesses a virtual machine that is a clone of the one used by the author (same data, same code), and the whole process is fully conducted within the secure computing environment.

Making research reproducible calls for more joint efforts such as this between academic journals, researchers and data providers. Given researchers’ relatively low reproducibility literacy, it is also vital to train them – especially the next generation – to understand and comply with the main principles of reproducible research.

Taking reproducibility seriously is a prerequisite for making science trustworthy and useful to society.

ADVERTISEMENT

Christophe Pérignon is professor of finance and associate dean for research at HEC Paris, and Christophe Hurlin is professor of economics at the University of Orléans, France. They are co-founders of cascad, the Certification Agency for Scientific Code and Data.

POSTSCRIPT:

Print headline: A badge that gives assurance

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Related articles

Reader's comments (1)

This is a super interesting initiative. But what happens when, in order to reproduce my analysis, you need run something for a month on 400 CPUs and use 30TB of disk space?

Sponsored

ADVERTISEMENT