Publications

2021
Trisovic A, Lau MK, Pasquier T, Crosas M. A large-scale study on research code quality and execution. Arxiv. 2021. Publisher's VersionAbstract
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
Trisovic A, Mika K, Boyd C, Feger S, Crosas M. Repository Approaches to Improving the Quality of Shared Data and Code. MDPI Data. 2021. Publisher's VersionAbstract
Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.
2020
Aggregated mobility data could help fight COVID-19
Buckee C, Balsari S, Chan J, Crosas M, Dominici F, Gasser U, Grad Y, Grenfell B, Halloran ME, Kraemer M, et al. Aggregated mobility data could help fight COVID-19. Science. 2020;23 March (Letters). Publisher's Version
Qualitative data sharing and synthesis for sustainability science
Alexander S, Jones K, Bennet N, Buden A, Cox M, Crosas M, Game E, Geary J, Hardy D, Johnson J, et al. Qualitative data sharing and synthesis for sustainability science. Nature Sustainability. 2020;(3) :81-88. Publisher's VersionAbstract
Socio–environmental synthesis as a research approach contributes to broader sustainability policy and practice by reusing data from disparate disciplines in innovative ways. Synthesizing diverse data sources and types of evidence can help to better conceptualize, investigate and address increasingly complex socio–environmental problems. However, sharing qualitative data for re-use remains uncommon when compared to sharing quantitative data. We argue that qualitative data present untapped opportunities for sustainability science, and discuss practical pathways to facilitate and realize the benefits from sharing and reusing qualitative data. However, these opportunities and benefits are also hindered by practical, ethical and epistemological challenges. To address these challenges and accelerate qualitative data sharing, we outline enabling conditions and suggest actions for researchers, institutions, funders, data repository managers and publishers.
FAIR principles: Interpretations and implementation considerations
Jacobsen A, de Azevedo RM, Juty N, Batista D, Coles S, Cornet R, Courtot M, Crosas M, Dumontier M, et al. FAIR principles: Interpretations and implementation considerations. Data Intelligence. 2020;2 (1-2) :10-29. Publisher's VersionAbstract
The FAIR principles have been widely cited, endorsed and adopted by a broad range of stakeholders since their publication in 2016. By intention, the 15 FAIR guiding principles do not dictate specific technological implementations, but provide guidance for improving Findability, Accessibility, Interoperability and Reusability of digital resources. This has likely contributed to the broad adoption of the FAIR principles, because individual stakeholder communities can implement their own FAIR solutions. However, it has also resulted in inconsistent interpretations that carry the risk of leading to incompatible implementations. Thus, while the FAIR principles are formulated on a high level and may be interpreted and implemented in different ways, for true interoperability we need to support convergence in implementation choices that are widely accessible and (re)-usable. We introduce the concept of FAIR implementation considerations to assist accelerated global participation and convergence towards accessible, robust, widespread and consistent FAIR implementations. Any self-identified stakeholder community may either choose to reuse solutions from existing implementations, or when they spot a gap, accept the challenge to create the needed solution, which, ideally, can be used again by other communities in the future. Here, we provide interpretations and implementation considerations (choices and challenges) for each FAIR principle.
2019
Crosas M. Whitepaper on Creating a Platform for the Sharing of Sensitive Online Data. 2019. Publisher's VersionAbstract
Dataverse, DataTags, and a Decade Building a Widely-Used Data Repository Platform
Evaluating FAIR maturity through a scalable, automated, community-governed framework
Wilkinson MD, Dumontier M, Sansone S-A, Olavo L, Prieto M, Batista D, McQuilton P, Kuhn T, Rocca-Serra P, Crosas M, et al. Evaluating FAIR maturity through a scalable, automated, community-governed framework. Nature-Springer Scientific Data. 2019;6 (174). Publisher's VersionAbstract
Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators – community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests – small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine “sees” when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources.
A Data Citation Roadmap for Scholarly Data Repositories
Fenner M, Crosas M, Grethe J, Kennedy D, Hermjakob H, Rocca-Serra P, Durand G, Berjon R, Karcher S, Martone M, et al. A Data Citation Roadmap for Scholarly Data Repositories. Nature-Springer Scientific Data. 2019;6 (28). Publisher's VersionAbstract
This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Expert Group, as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH-funded BioCADDIE (https://biocaddie.org) project. The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories. We describe the early adoption of these recommendations 18 months after they have first been published, looking specifically at implementations of machine-readable metadata on dataset landing pages.
2018
Crosas M, Gautier J, Karcher S, Kirilova D, Otalora G, Schwartz A. Data policies of highly-ranked social science journals. SocArXiv. 2018;March.Abstract

By encouraging and requiring that authors share their data in order to publish articles, scholarly journals have become an important actor in the movement to improve the openness of data and the reproducibility of research. But how many social science journals encourage or mandate that authors share the data supporting their research findings? How does the share of journal data policies vary by discipline? What influences these journals’ decisions to adopt such policies and instructions? And what do those policies and instructions look like? We discuss the results of our analysis of the instructions and policies of 291 highly-ranked journals publishing social science research, where we studied the contents of journal data policies and instructions across 14 variables, such as when and how authors are asked to share their data, and what role journal ranking and age play in the existence and quality of data policies and instructions. We also compare our results to the results of other studies that have analyzed the policies of social science journals, although differences in the journals chosen and how each study defines what constitutes a data policy limit this comparison. We conclude that a little more than half of the journals in our study have data policies. A greater share of the economics journals have data policies and mandate sharing, followed by political science/international relations and psychology journals. Finally, we use our findings to make several recommendations: Policies should include the terms “data,” “dataset” or more specific terms that make it clear what to make available; policies should include the benefits of data sharing; journals, publishers, and associations need to collaborate more to clarify data policies; and policies should explicitly ask for qualitative data.

This paper has won the IASSIST & Carto 2018 Best Paper award.

data_policies_of_highly-ranked_social_science_journals.pdf
Pasquier T, Lau MK, Han X, Fong E, Lerner BS, Boose E, Crosas M, Ellison A, Seltzer M. Sharing and Preserving Computational Analyses for Posterity with Encapsulator. Computing in Science and Engineering (CiSE), IEEE. 2018;May. Preprint VersionAbstract
Open data and open-source software may be part of the solution to sciences reproducibility crisis, but they are insufficient to guarantee reproducibility. Requiring minimal end-user expertise, encapsulator creates a “time capsule” with reproducible code in a self-contained computational environment. encapsulator provides end-users with a fully-featured desktop environment for reproducible research. 
cise-2018.pdf
2017
If These Data Could Talk
Pasquier T, Lau M, Trisovic A, Boose E, Couturierer B, Crosas M, Ellison A, Gibson V, Jones C, Seltzer M. If These Data Could Talk. Nature Scientific Data. 2017. Publisher's VersionAbstract
In the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientific fields exhibit distressingly low rates of reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and efficiency of reporting, which contributes to issues of reproducibility. Data provenance aids both reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers.
sdata2017114.pdf
Cloud Dataverse: A Data Repository Platform for the Cloud
Crosas M. Cloud Dataverse: A Data Repository Platform for the Cloud. CIO Review. 2017. Publisher's Version
Data Authorship as an Incentive to Data Sharing
Bierer BE, Crosas M, Pierce HH. Data Authorship as an Incentive to Data Sharing. New England Journal of Medicine. 2017;Sounding Board (March 29, 2017). Publisher's Version
2016
A Data Citation Roadmap for Scholarly Data Repositories
Fenner M, Crosas M, Grethe J, Kennedy D, Hermjakob H, Roca-Serra P, Berjon R, Martone M, Clark T. A Data Citation Roadmap for Scholarly Data Repositories. BioArxiv [preprint]. 2016. Publisher's VersionAbstract

This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group, 2014), a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Early Adopters Expert Group, part of the Data Citation Implementation Pilot (DCIP) project (FORCE11, 2015), an initiative of FORCE11.org and the NIH BioCADDIE (2016) program. The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories.

datacitationroadmap-097196.pdf
Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets
McKinney B, Meyer P, Crosas M, Sliz P. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets. The Annals of the New York Academy of Sciences. 2016. Publisher's VersionAbstract

Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of file system structure within Dataverse—which is essential for both in-place computation and supporting non-HTTP data transfers.

Meyer P, et al. Data publication with the structural biology data grid supports live analysis. Nature Communications. 2016;(10882). Publisher's VersionAbstract

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

ncomms10882.pdf
Bar-Sinai M, Sweeney L, Crosas M. DataTags, Data Handling Policy Spaces and the Tags Language. 2016 IEEE Security and Privacy Workshops (SPW). 2016. Publisher's VersionAbstract

Widespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper.

iwpe16-20.pdf
Wilkinson M, et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data. 2016;(160018). Publisher's VersionAbstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

sdata201618.pdf
2015
Altman M, Borgman C, Crosas M, Martone M. An Introduction to the Joint Principles of Data Citation. Bulletin of the Association for Information Science and Technology. 2015;41 (3) :43-44. Publisher's VersionAbstract

Data citation is rapidly emerging as a key practice supporting data access, sharing and reuse, as well as sound and reproducible scholarship. Consensus data citation principles, articulated through the Joint Declaration of Data Citation Principles, represent an advance in the state of the practice and a new consensus on citation.

febmar15_rdap_altman_etal.pdf
Altman M, Castro E, Crosas M, Durbin P, Garnett A, Whitney J. Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research. Code4Lib Journal. 2015;(Issue 30). Publisher's VersionAbstract

This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.

codeforlib.pdf

Pages