Data police force will help clean up research

An independent service to check data and algorithms is the only way to resolve the research replication crisis, say Christophe Pérignon and Christophe Hurlin

Published on

August 26, 2019

Last updated

August 27, 2019

Christophe Pérignon Christophe Hurlin

data manipulation

In many scientific papers, the core of the analysis is computational. Researchers spend months – sometimes years – collecting and cleaning data, writing and debugging computer code, and then running and rerunning their work. Yet those data and code never enter the peer-review process. No wonder, you might argue, that reproducibility is not the norm in modern science.

When reviewing a manuscript, journal editors and referees have traditionally had to assume that the results outlined are the genuine output from running the researchers’ computer code on their data. Over the past decade, some journals have begun to instruct authors to upload their code and data to dedicated online repositories after the acceptance of the paper, so that, in principle, other researchers can download all the necessary resources to redo their analysis. However, such initiatives have been only partially successful in improving transparency.

There are two main reasons. First, the posted code and data are not checked systematically. Their quality, therefore, is sometimes low – particularly because researchers lack time and incentives to prepare them properly. This makes it hard even for specialists to redo the analysis and fully reproduce an original study.

Second, an increasing number of academic papers rely on confidential data relating to individuals; examples include data on income, employment, taxes and health. These are available only to accredited users within a secure computing environment and cannot be shared. In some cases, an anonymised version of the data can be made public, but recent evidence suggests that this approach is not yet able to provide a guarantee that privacy is preserved. A paper recently published in Nature Communications shows that 99.98 per cent of Americans can be identified from any anonymised dataset with as few as 15 attributes, such as gender, zip code or marital status.

That well-trained researchers are sometimes unable to replicate the results of papers published in their field is a serious concern and calls for action. Some academic journals take the issue very seriously and rerun authors’ code on their data to check for reproducibility. The journal Biostatistics has been implementing such a verification process for several years, and the American Economic Review recently announced that it is about to do the same. Many journals, however, lack the time or specialised staff to deal with numerous software and data sources.

As an alternative, we advocate an external solution provided by a specialised certification agency, acting as a trusted third party. To this end, we recently launched cascad, the Certification Agency for Scientific Code and Data, as a non-profit academic initiative.

When a researcher requests a reproducibility certificate, a cascad reviewer runs their code on their data to verify that the output corresponds to the results presented in the tables and figures in their manuscript. The certificate can then be submitted to journals alongside the manuscript, giving the editor and reviewers confidence that the paper is all that it seems.

Another key advantage of a trusted third party is its ability to certify the reproducibility of research based on confidential data. For instance, as shown in a recent publication in Science, cascad not long ago partnered with France’s Secure Data Access Centre, a public body that allows researchers to access and work with confidential governmental data under secure conditions. The centre creates a virtual machine allowing researchers to remotely access the specific datasets needed for their projects, as well as the required statistical software. The cascad reproducibility reviewer then accesses a virtual machine that is a clone of the one used by the author (same data, same code), and the whole process is fully conducted within the secure computing environment.

Making research reproducible calls for more joint efforts such as this between academic journals, researchers and data providers. Given researchers’ relatively low reproducibility literacy, it is also vital to train them – especially the next generation – to understand and comply with the main principles of reproducible research.

Taking reproducibility seriously is a prerequisite for making science trustworthy and useful to society.

Christophe Pérignon is professor of finance and associate dean for research at HEC Paris, and Christophe Hurlin is professor of economics at the University of Orléans, France. They are co-founders of cascad, the Certification Agency for Scientific Code and Data.

Read more about

Read more about:

Academic publishing

Science, technology, engineering and mathematics (STEM)

POSTSCRIPT:

Print headline: A badge that gives assurance

Register to continue

Why register?

Registration is free and only takes a moment
Once registered, you can read 3 articles a month
Sign up for our newsletter

Subscribe

Or subscribe for unlimited access to:

Unlimited access to news, views, insights & reviews
Digital editions
Digital access to THE’s university and college rankings analysis

Please or to read this article.

Related articles

Twins walking together Twins Day Festival in Twinsburg Ohio to illustrate Will ‘dual enrolment’ surge promote social equity

Reproducibility of research is critical for open science and open Britain

Science that is robust and reproducible will stimulate economic growth and social benefits, argue Marcus Munafò and Neil Jacobs

By Marcus Munafò

5 January

twins sunflowers

Is science really facing a reproducibility crisis?

NAS calls for US lawmakers to bring change also brings warning that crisis talk may ultimately ‘stifle frontier discoveries’

By Rachael Pells

23 April

fortune teller

Bid to use AI to predict research reproducibility launched

US government funding $7.6 million (£5.9 million) project designed to give policymakers a quick indication of reproducibility

By Rachael Pells

8 February

Man and dog dressed alike

Claims that reproducibility crisis ‘overblown’ spark debate

Scientists reject the ‘crisis narrative’ as an inflammatory distraction from bigger issues

By Rachael Pells

29 March

Related universities

Reader's comments (1)

#1 Submitted by i.... on August 27, 2019 - 1:59pm

This is a super interesting initiative. But what happens when, in order to reproduce my analysis, you need run something for a month on 400 CPUs and use 30TB of disk space?

Sponsored