• Sol

    The Sun

  • January

    January

  • February

    February

  • March

    March

  • April

    April

  • May

    May

  • June

    June

  • July

    July

  • August

    August

  • September

    September

  • October

    October

  • November

    November

  • December

    December

I’m  a data technologist and researcher, currently holding two roles at Harvard University, as the University Research Data Management Officer, with Harvard University Information Technology (HUIT), and the Chief Data Science and Technology Officer at Harvard's Institute for Quantitative Social Science. My career journey has included research in astrophysics, design and implementation of software for astronomical observations, development of learning and data management systems for education and biotechnologies, and now leading software platforms and tools for research data sharing and analysis, applied to all research fields. 

What am I interested in? Open science to facilitate access and reuse of research data and code while preserving privacy, build software to enhance the quality and productivity of scientific outcomes,  improve research data management, and establish data-centric multidisciplinary collaborations with the aid of technology and a human touch.

Recent Publications

Trisovic A, Lau MK, Pasquier T, Crosas M. A large-scale study on research code quality and execution. Arxiv [Internet]. 2021. Publisher's VersionAbstract
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
Trisovic A, Mika K, Boyd C, Feger S, Crosas M. Repository Approaches to Improving the Quality of Shared Data and Code. MDPI Data [Internet]. 2021. Publisher's VersionAbstract
Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.
Qualitative data sharing and synthesis for sustainability science
Alexander S, Jones K, Bennet N, Buden A, Cox M, Crosas M, Game E, Geary J, Hardy D, Johnson J, et al. Qualitative data sharing and synthesis for sustainability science. Nature Sustainability [Internet]. 2020;(3) :81-88. Publisher's VersionAbstract
Socio–environmental synthesis as a research approach contributes to broader sustainability policy and practice by reusing data from disparate disciplines in innovative ways. Synthesizing diverse data sources and types of evidence can help to better conceptualize, investigate and address increasingly complex socio–environmental problems. However, sharing qualitative data for re-use remains uncommon when compared to sharing quantitative data. We argue that qualitative data present untapped opportunities for sustainability science, and discuss practical pathways to facilitate and realize the benefits from sharing and reusing qualitative data. However, these opportunities and benefits are also hindered by practical, ethical and epistemological challenges. To address these challenges and accelerate qualitative data sharing, we outline enabling conditions and suggest actions for researchers, institutions, funders, data repository managers and publishers.
More

Recent Presentations

OpenDP, an open-source suite of tools for deploying differential privacy, at MLSE 2020, Monday, December 14, 2020:

OpenDP-MLSESince it was introduced in 2006, differential privacy (DP) has become accepted as a gold standard for ensuring that individual-level information is not leaked through statistical analyses or machine learning on sensitive datasets. OpenDP comes at a time when computation and...

Read more about OpenDP, an open-source suite of tools for deploying differential privacy
More

Tweets from @mercecrosas