Publications

Forthcoming
Jones K, Alexander SM, Bennett N, Bishop L, Budden A, Cox M, Crosas M, Game E, Geary J, Hahn C, et al. Qualitative data sharing and re-use for socio-environmental systems research: A synthesis of opportunities, challenges, resources and approaches. [Internet]. Forthcoming. Publisher's VersionAbstract

Researchers in many disciplines, both social and natural sciences, have a long history of collecting and analyzing qualitative data to answer questions that have many dimensions, to interpret other research findings, and to characterize processes that are not easily quantified. Qualitative data is increasingly being used in socio-environmental systems research and related interdisciplinary efforts to address complex sustainability challenges. There are many scientific, descriptive and material benefits to be gained from sharing and re-using qualitative data, some of which reflect the broader push toward open science, and some of which are unique to qualitative research traditions. However, although open data availability is increasingly becoming an expectation in many fields and methodological approaches that work on socio-environmental topics, there remain many challenges associated the sharing and re-use of qualitative data in particular. This white paper discusses opportunities, challenges, resources and approaches for qualitative data sharing and re-use for socio-environmental research. The content and findings of the paper are a synthesis and extension of discussions that began during a workshop funded by the National Socio-Environmental Synthesis Center (SESYNC) and held at the Center Feb. 28-March 2, 2017. The structure of the paper reflects the starting point for the workshop, which focused on opportunities, challenges and resources for qualitative data sharing, and presents as well the workshop outputs focused on developing a novel approach to qualitative data sharing considerations and creating recommendations for how a variety of actors can further support and facilitate qualitative data sharing and re-use. The white paper is organized into five sections to address the following objectives: (1) Define qualitative data and discuss the benefits of sharing it along with its role in socio-environmental synthesis; (2) Review the practical, epistemological, and ethical challenges regarding sharing such data; (3) Identify the landscape of resources available for sharing qualitative data including repositories and communities of practice (4) Develop a novel framework for identifying levels of processing and access to qualitative data; and (5) Suggest roles and responsibilities for key actors in the research ecosystem that can improve the longevity and use of qualitative data in the future.

2017
If These Data Could Talk
Pasquier T, Lau M, Trisovic A, Boose E, Couturierer B, Crosas M, Ellison A, Gibson V, Jones C, Seltzer M. If These Data Could Talk. Nature Scientific Data [Internet]. 2017. Publisher's VersionAbstract
In the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientific fields exhibit distressingly low rates of reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and efficiency of reporting, which contributes to issues of reproducibility. Data provenance aids both reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers.
sdata2017114.pdf
Cloud Dataverse: A Data Repository Platform for the Cloud
Crosas M. Cloud Dataverse: A Data Repository Platform for the Cloud. CIO Review [Internet]. 2017. Publisher's Version
Data Authorship as an Incentive to Data Sharing
Bierer BE, Crosas M, Pierce HH. Data Authorship as an Incentive to Data Sharing. New England Journal of Medicine [Internet]. 2017;Sounding Board (March 29, 2017). Publisher's Version
2016
A Data Citation Roadmap for Scholarly Data Repositories
Fenner M, Crosas M, Grethe J, Kennedy D, Hermjakob H, Roca-Serra P, Berjon R, Martone M, Clark T. A Data Citation Roadmap for Scholarly Data Repositories. BioArxiv [preprint] [Internet]. 2016. Publisher's VersionAbstract

This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group, 2014), a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Early Adopters Expert Group, part of the Data Citation Implementation Pilot (DCIP) project (FORCE11, 2015), an initiative of FORCE11.org and the NIH BioCADDIE (2016) program. The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories.

datacitationroadmap-097196.pdf
Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets
McKinney B, Meyer P, Crosas M, Sliz P. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets. The Annals of the New York Academy of Sciences [Internet]. 2016. Publisher's VersionAbstract

Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of file system structure within Dataverse—which is essential for both in-place computation and supporting non-HTTP data transfers.

Meyer P, et al. Data publication with the structural biology data grid supports live analysis. Nature Communications [Internet]. 2016;(10882). Publisher's VersionAbstract

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

ncomms10882.pdf
Bar-Sinai M, Sweeney L, Crosas M. DataTags, Data Handling Policy Spaces and the Tags Language. 2016 IEEE Security and Privacy Workshops (SPW) [Internet]. 2016. Publisher's VersionAbstract

Widespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper.

iwpe16-20.pdf
Wilkinson M, et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data [Internet]. 2016;(160018). Publisher's VersionAbstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

sdata201618.pdf
2015
Altman M, Borgman C, Crosas M, Martone M. An Introduction to the Joint Principles of Data Citation. Bulletin of the Association for Information Science and Technology [Internet]. 2015;41 (3) :43-44. Publisher's VersionAbstract

Data citation is rapidly emerging as a key practice supporting data access, sharing and reuse, as well as sound and reproducible scholarship. Consensus data citation principles, articulated through the Joint Declaration of Data Citation Principles, represent an advance in the state of the practice and a new consensus on citation.

febmar15_rdap_altman_etal.pdf
Altman M, Castro E, Crosas M, Durbin P, Garnett A, Whitney J. Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research. Code4Lib Journal [Internet]. 2015;(Issue 30). Publisher's VersionAbstract

This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.

codeforlib.pdf
Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: the DataTags System. Technology Science [Internet]. 2015. Publisher's VersionAbstract

Society generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

techsci-datatags-sweeneycrosasbarsinai.pdf
Sweeney L, Crosas M. An Open Science Platform for the Next Generation of Data. Arxiv.org Computer Science, Computers and Scoiety [Internet]. 2015. Publisher's VersionAbstract

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable. (This manuscript was originally conceived in 2013)

Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak L, Haendel M, Herman I, Hodson S, et al. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science [Internet]. 2015. Publisher's Version
Crosas M, Honaker J, King G, Sweeney L. Automating Open Science for Big Data. ANNALS of the American Academy of Political and Social Science [Internet]. 2015;659 (1) :260-273. Publisher's VersionAbstract

The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

AutomaticOpenScienceforBigData-2015-crosas-260-73.pdf
2014
Pepe A, Goodman A, Muench A, Crosas M, Erdmann C. How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers. PLoS ONE. 2014;9.
Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, Di Stefano R, Gil Y, Groth P, Hedstrom M. Ten simple rules for the care and feeding of scientific data. PLoS computational biology. 2014;10.
2013
Crosas M. A data sharing story. Journal of eScience Librarianship [Internet]. 2013;1 :7. Publisher's Version
Altman M, Crosas M. The evolution of data citation: From principles to implementation. IASSIST Quarterly [Internet]. 2013;37. Publisher's Version
Rajasekar A, Sankaran S, Lander H, Carsey T, Crabtree J, Crosas M, King G, Kum H-C, Zhan J. Sociometric Methods for Relevancy Analysis of Long Tail Science Data, in Social Computing (SocialCom), 2013 International Conference on. IEEE ; 2013 :1–6.

Pages