Integrating Replication Tools with Data Repositories

Presentation Date: 

Wednesday, August 29, 2018


Pre-APSA DataPASS workshop, Boston, MA

Presentation Slides: 

Talk given by Matthew Lau and Mercè Crosas (Harvard University) at the pre-conference APSA workshop organized by Data-PASS.

Recent findings of low levels of reproducibility in research has been a wake-up call to scientists. In addition to the challenges of making study details and data and metadata available and accessible, the rapid rise of custom analytical software (such as R, MatLab and Python scripts) is quickly becoming a significant challenge as well. Such analytical scripts that are used in scientific research are often informal, written without following software best practices. This is leading to a proliferation of irreproducible software. Given the realities of the demands placed on scientists, we have investigated the use of "data provenance" (i.e. a formalized record of a computational process) to produce tools to help researchers improve the transparency and reproducibility of analytical software associated with a research project. This talk will present the concept of data provenance and how it has been applied to create tools, such as an automatic project "capsule" creation program (encapsulator) and a code cleaning package for R (Rclean), to aid in the process of sharing research through public project repositories like Dataverse.