Publications

2023
Charles Tapley Hoyt, Amelia Hoyt, and Benjamin M. Gyori. 3/14/2023. “Prediction and Curation of Missing Biomedical Identifier Mappings with Biomappings.” Bioinformatics, Pp. btad130. Publisher's Version
Rudolf T Pillich, Jing Chen, Christopher Churras, Dylan Fong, Trey Ideker, Sophie N Liu, Benjamin M Gyori, Klas Karis, Keiichiro Ono, Alexander Pico, and Dexter Pratt. 3/6/2023. “NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange.” Bioinformatics, Pp. btad118. Publisher's Version
John A. Bachman, Benjamin M Gyori, and Peter K. Sorger. 3/2023. “Automated assembly of molecular mechanisms at scale from text mining and curated databases.” Molecular Systems Biology, Pp. e11325. Publisher's VersionAbstract
The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein–protein interaction databases and explain co-dependencies in the Cancer Dependency Map.
Sebastian Lobentanzer, Patrick Aloy, Jan Baumbach, Balazs Bohar, Pornpimol Charoentong, Katharina Danhauser, Tunca Doğan, Johann Dreo, Ian Dunham, Adrià Fernandez-Torras, Benjamin M. Gyori, Michael Hartung, Charles Tapley Hoyt, Christoph Klein, Tamas Korcsmaros, Andreas Maier, Matthias Mann, David Ochoa, Elena Pareja-Lorente, Ferdinand Popp, Martin Preusse, Niklas Probul, Benno Schwikowski, Bünyamin Sen, Maximilian T. Strauss, Denes Turei, Erva Ulusoy, Judith Andrea Heidrun Wodke, and Julio Saez-Rodriguez. 2023. “Democratising Knowledge Representation with BioCypher.” arXiv preprint. Publisher's Version
Aakanksha Jain, Benjamin M. Gyori, Sara Hakim, Samuel Bunga, Daniel G Taub, Mari Carmen Ruiz-Cantero, Candace Tong-Li, Nicholas Andrews, Peter K Sorger, and Clifford J Woolf. 2023. “Nociceptor neuroimmune interactomes reveal cell type- and injury-specific inflammatory pain pathways.” bioRxiv. Publisher's Version
2022
Charles Tapley Hoyt, Meghan Balk, Tiffany J Callahan, Daniel Domingo-Fernández, Melissa A. Haendel, Harshad B. Hegde, Daniel S. Himmelstein, Klas Karis, John Kunze, Tiago Lubiana, Nicolas Matentzoglu, Julie McMurry, Sierra Moxon, Christopher J. Mungall, Adriano Rutz, Deepak R. Unni, Egon Willighagen, Donald Winston, and Benjamin M. Gyori. 11/2022. “Unifying the Identification of Biomedical Entities with the Bioregistry.” Scientific Data, 9, 1, Pp. 714. Publisher's Version
Laura M. Doherty, Caitlin E.Mills, Sarah A. Boswell, Xiaoxi Liu, Charles Tapley Hoyt, Benjamin M. Gyori, Sara J. Buhrlage, and Peter K. Sorger. 6/23/2022. “Integrating multi-omics data reveals function and therapeutic potential of deubiquitinating enzymes.” eLife, 11, e72879. Publisher's Version
Benjamin M. Gyori, Charles Tapley Hoyt, and Albert Steppi. 5/2022. “Gilda: biomedical entity text normalization with machine-learned disambiguation as a service.” Bioinformatics Advances, vbac034. Publisher's VersionAbstract
Summary Gilda is a software tool and web service which implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity.Availability The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials are available via https://github.com/indralab/gilda.Contact benjamin_gyoriathms.harvard.eduCompeting Interest StatementThe authors have declared no competing interest.
Bernice Scholten, Laura Guerrero Simón, Shaji Krishnan, Roel Vermeulen, Anjoeka Pronk, Benjamin M. Gyori, John A. Bachman, Jelle Vlaanderen, and Rob Stierum. 3/3/2022. “Automated Network Assembly of Mechanistic Literature for Informed Evidence Identification to Support Cancer Risk Assessment.” Environmental Health Perspectives, 130, 3, Pp. 037002. Publisher's Version
Benjamin M. Gyori and Charles Tapley Hoyt. 2022. “PyBioPAX: biological pathway exchange in Python.” Journal of Open Source Software, 7, 71, Pp. 4136. Publisher's Version
Valérie de Crécy-lagard, Rocio Amorin de Hegedus, Cecilia Arighi, Jill Babor, Alex Bateman, Ian Blaby, Crysten Blaby-Haas, Alan J Bridge, Stephen K Burley, Stacey Cleveland, Lucy J Colwell, Ana Conesa, Christian Dallago, Antoine Danchin, Anita de Waard, Adam Deutschbauer, Raquel Dias, Yousong Ding, Gang Fang, Iddo Friedberg, John Gerlt, Joshua Goldford, Mark Gorelik, Benjamin M Gyori, Christopher Henry, Geoffrey Hutinet, Marshall Jaroch, Peter D Karp, Liudmyla Kondratova, Zhiyong Lu, Aron Marchler-Bauer, Maria-Jesus Martin, Claire McWhite, Gaurav D Moghe, Paul Monaghan, Anne Morgat, Christopher J Mungall, Darren A Natale, William C Nelson, Seán O’Donoghue, Christine Orengo, Katherine H O’Toole, Predrag Radivojac, Colbie Reed, Richard J Roberts, Dmitri Rodionov, Irina A Rodionova, Jeffrey D Rudolf, Lana Saleh, Gloria Sheynkman, Francoise Thibaud-Nissen, Paul D Thomas, Peter Uetz, David Vallenet, Erica Watson Carter, Peter R Weigele, Valerie Wood, Elisha M Wood-Charlson, and Jin Xu. 2022. “A roadmap for the functional annotation of protein families: a community perspective.” Database, 2022. Publisher's VersionAbstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3–4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
John Hungerford, Yee Seng Chan, Jessica MacBride, Benjamin M. Gyori, Andrew Zupon, Zheng Tang, Egoitz Laparra, Haoling Qiu, Bonan Min, Yan Zverev, Caitlin Hilverman, Max Thomas, Walt Andrews, Keith Alcock, Zeyu Zhang, Michael Reynolds, Mihai Surdeanu, Steve Bethard, and Rebecca Sharp. 2022. “Taxonomy Builder: a Data-driven and User-centric Tool for Streamlining Taxonomy Construction.” In NAACL HCI+NLP. Seattle, Washington.
Charles Tapley Hoyt, Max Berrendorf, Mikhail Galkin, Volker Tresp, and Benjamin M. Gyori. 2022. “A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs”. Publisher's Version
John A. Bachman, Peter K. Sorger, and Benjamin M. Gyori. 2022. “Assembling a phosphoproteomic knowledge base using ProtMapper to normalize phosphosite information from databases and text mining.” bioRxiv. Publisher's Version
Benedek Rozemberczki, Charles Tapley Hoyt, Anna Gogleva, Piotr Grabowski, Klas Karis, Andrej Lamov, Andriy Nikolov, Sebastian Nilsson, Michael Ughetto, Yu Wang, Tyler Derr, and Benjamin M Gyori. 2022. “ChemicalX: A Deep Learning Library for Drug Pair Scoring.” In KDD 2022. Publisher's Version
Helena Balabin, Charles Tapley Hoyt, Benjamin M. Gyori, John Bachman, Alpha Tom Kodamullil, Martin Hofmann-Apitius, and Daniel Domingo-Fern\´andez. 2022. “ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge Graphs.” In SWAT4HCLS 2022, Pp. 103 – 107. Publisher's VersionAbstract
While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowledge graphs, their union can better exploit the advantages of such approaches, ultimately improving representations of biology. Using multimodal transformers for such purposes can improve performance on context dependent classication tasks, as demonstrated by our previous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce ProtSTonKGs, a transformer aimed at learning all-encompassing representations of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark ProtSTonKGs against STonKGs, resulting in improved F1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of information, paving the foundation for future approaches that use multiple modalities for biomedical applications.
Nicolas Matentzoglu, James P. Balhoff, Susan M. Bello, Chris Bizon, Matthew H. Brush, Tiffany J. Callahan, Christopher G. Chute, William D. Duncan, Chris T. A. Evelo, Davera Gabriel, John Graybeal, Alasdair J. G. Gray, Benjamin M. Gyori, Melissa A. Haendel, Henriette Harmse, Nomi L. Harris, Ian Harrow, Harshad Hegde, Amelia L. Hoyt, Charles Tapley Hoyt, Dazhi Jiao, Ernesto Jiménez-Ruiz, Simon Jupp, Hyeongsik Kim, Sebastian Köhler, Thomas Liener, Qinqin Long, James Malone, James A. McLaughlin, Julie A. McMurry, Sierra A. T. Moxon, Monica C. Munoz-Torres, David Osumi-Sutherland, James A. Overton, Bjoern Peters, Tim E. Putman, Núria Queralt-Rosinach, Kent A. Shefchek, Harold Solbrig, Anne E. Thessen, Tania Tudorache, Nicole A. Vasilevsky, Alex H. Wagner, and Christopher J. Mungall. 2022. “A Simple Standard for Sharing Ontological Mappings (SSSOM).” Database, 2022, baac035. Publisher's Version
Helena Balabin, Charles Tapley Hoyt, Colin Birkenbihl, Benjamin M Gyori, John Bachman, Alpha Tom Kodamullil, Paul G Plöger, Martin Hofmann-Apitius, and Daniel Domingo-Fernández. 2022. “STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs.” Bioinformatics. Publisher's VersionAbstract
The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.Competing Interest StatementDaniel Domingo-Fernandez received salary from Enveda Biosciences.
Anna Niarakis, Marek Ostaszewski, Alexander Mazein, .., Benjamin M Gyori, .., Reinhard Schneider, and COVID-19 Disease Map the Community. 2022. “A versatile and interoperable computational framework for the analysis and modeling of COVID-19 disease mechanisms.” bioRxiv. Publisher's VersionAbstract
The COVID-19 Disease Map project is a large-scale community effort uniting 277 scientists from 130 Institutions around the globe. We use high-quality, mechanistic content describing SARS-CoV-2-host interactions and develop interoperable bioinformatic pipelines for novel target identification and drug repurposing. Community-driven and highly interdisciplinary, the project is collaborative and supports community standards, open access, and the FAIR data principles. The coordination of community work allowed for an impressive step forward in building interfaces between Systems Biology tools and platforms. Our framework links key molecules highlighted from broad omics data analysis and computational modeling to dysregulated pathways in a cell-, tissue- or patient-specific manner. We also employ text mining and AI-assisted analysis to identify potential drugs and drug targets and use topological analysis to reveal interesting structural features of the map. The proposed framework is versatile and expandable, offering a significant upgrade in the arsenal used to understand virus-host interactions and other complex pathologies.Competing Interest StatementA. Niarakis collaborates with SANOFI-AVENTIS R&D via a public private partnership grant (CIFRE contract, no 2020/0766). D. Maier and A. Bauch are employed at Biomax Informatics AG and will be affected by any effect of this publication on the commercial version of the AILANI software. J.A. Bachman and B. Gyori received consulting fees from Two Six Labs, LLC. T. Helikar has served as a shareholder and has consulted for Discovery Collective, Inc. R. Balling and R. Schneider are founders and shareholders of MEGENO S.A. and ITTM S.A. J. Saez-Rodriguez receives funding from GSK and Sanofi and consultant fees from Travere Therapeutics. Janet Pinero and Laura I. Furlong are employees and shareholders of MedBioinformatics Solutions SL. The remaining authors have declared that they have no Conflict of interest.
2021
Jeffrey Wong, Max Franz, Metin Can Siper, Dylan Fong, Funda Durupinar, Christian Dallago, Augustin Luna, John M. Giorgi, Igor Rodchenkov, Özgün Babur, John A. Bachman, Benjamin M. Gyori, Emek Demir, Gary Bader, and Chris Sander. 12/3/2021. “Author-sourced capture of pathway knowledge in computable form using Biofactoid.” eLife, 10, Pp. e68292. Publisher's Version

Pages