Charles Tapley Hoyt, Meghan Balk, Tiffany J Callahan, Daniel Domingo-Fernández, Melissa A. Haendel, Harshad B. Hegde, Daniel S. Himmelstein, Klas Karis, John Kunze, Tiago Lubiana, Nicolas Matentzoglu, Julie McMurry, Sierra Moxon, Christopher J. Mungall, Adriano Rutz, Deepak R. Unni, Egon Willighagen, Donald Winston, and Benjamin M. Gyori. 7/2022. “The Bioregistry: Unifying the Identification of Biomedical Entities through an Integrative, Open, Community-driven Metaregistry.” bioRxiv, Pp. 7/2022. 07.08.499378. Publisher's Version
Laura M. Doherty, Caitlin E.Mills, Sarah A. Boswell, Xiaoxi Liu, Charles Tapley Hoyt, Benjamin M. Gyori, Sara J. Buhrlage, and Peter K. Sorger. 6/23/2022. “Integrating multi-omics data reveals function and therapeutic potential of deubiquitinating enzymes.” eLife, 11, e72879. Publisher's Version
Benjamin M. Gyori, Charles Tapley Hoyt, and Albert Steppi. 5/2022. “Gilda: biomedical entity text normalization with machine-learned disambiguation as a service.” Bioinformatics Advances, vbac034. Publisher's VersionAbstract
Summary Gilda is a software tool and web service which implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity.Availability The Gilda web service is available at with source code, documentation and tutorials are available via benjamin_gyoriathms.harvard.eduCompeting Interest StatementThe authors have declared no competing interest.
Bernice Scholten, Laura Guerrero Simón, Shaji Krishnan, Roel Vermeulen, Anjoeka Pronk, Benjamin M. Gyori, John A. Bachman, Jelle Vlaanderen, and Rob Stierum. 3/3/2022. “Automated Network Assembly of Mechanistic Literature for Informed Evidence Identification to Support Cancer Risk Assessment.” Environmental Health Perspectives, 130, 3, Pp. 037002. Publisher's Version
Benedek Rozemberczki, Charles Tapley Hoyt, Anna Gogleva, Piotr Grabowski, Klas Karis, Andrej Lamov, Andriy Nikolov, Sebastian Nilsson, Michael Ughetto, Yu Wang, Tyler Derr, and Benjamin M Gyori. 2022. “ChemicalX: A Deep Learning Library for Drug Pair Scoring”. Publisher's Version
Benjamin M. Gyori and Charles Tapley Hoyt. 2022. “PyBioPAX: biological pathway exchange in Python.” Journal of Open Source Software, 7, 71, Pp. 4136. Publisher's Version
Charles Tapley Hoyt, Max Berrendorf, Mikhail Galkin, Volker Tresp, and Benjamin M. Gyori. 2022. “A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs”. Publisher's Version
Helena Balabin, Charles Tapley Hoyt, Benjamin M. Gyori, John Bachman, Alpha Tom Kodamullil, Martin Hofmann-Apitius, and Daniel Domingo-Fern\´andez. 2022. “ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge Graphs.” In SWAT4HCLS 2022, Pp. 103 – 107. Publisher's VersionAbstract
While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowledge graphs, their union can better exploit the advantages of such approaches, ultimately improving representations of biology. Using multimodal transformers for such purposes can improve performance on context dependent classication tasks, as demonstrated by our previous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce ProtSTonKGs, a transformer aimed at learning all-encompassing representations of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark ProtSTonKGs against STonKGs, resulting in improved F1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of information, paving the foundation for future approaches that use multiple modalities for biomedical applications.
Nicolas Matentzoglu, James P. Balhoff, Susan M. Bello, Chris Bizon, Matthew H. Brush, Tiffany J. Callahan, Christopher G. Chute, William D. Duncan, Chris T. A. Evelo, Davera Gabriel, John Graybeal, Alasdair J. G. Gray, Benjamin M. Gyori, Melissa A. Haendel, Henriette Harmse, Nomi L. Harris, Ian Harrow, Harshad Hegde, Amelia L. Hoyt, Charles Tapley Hoyt, Dazhi Jiao, Ernesto Jiménez-Ruiz, Simon Jupp, Hyeongsik Kim, Sebastian Köhler, Thomas Liener, Qinqin Long, James Malone, James A. McLaughlin, Julie A. McMurry, Sierra A. T. Moxon, Monica C. Munoz-Torres, David Osumi-Sutherland, James A. Overton, Bjoern Peters, Tim E. Putman, Núria Queralt-Rosinach, Kent A. Shefchek, Harold Solbrig, Anne E. Thessen, Tania Tudorache, Nicole A. Vasilevsky, Alex H. Wagner, and Christopher J. Mungall. 2022. “A Simple Standard for Sharing Ontological Mappings (SSSOM).” Database, 2022, baac035. Publisher's Version
Helena Balabin, Charles Tapley Hoyt, Colin Birkenbihl, Benjamin M Gyori, John Bachman, Alpha Tom Kodamullil, Paul G Plöger, Martin Hofmann-Apitius, and Daniel Domingo-Fernández. 2022. “STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs.” Bioinformatics. Publisher's VersionAbstract
The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at and Interest StatementDaniel Domingo-Fernandez received salary from Enveda Biosciences.
Jeffrey Wong, Max Franz, Metin Can Siper, Dylan Fong, Funda Durupinar, Christian Dallago, Augustin Luna, John M. Giorgi, Igor Rodchenkov, Özgün Babur, John A. Bachman, Benjamin M. Gyori, Emek Demir, Gary Bader, and Chris Sander. 12/3/2021. “Author-sourced capture of pathway knowledge in computable form using Biofactoid.” eLife, 10, Pp. e68292. Publisher's Version
Benjamin M. Gyori and John A. Bachman. 2021. “From knowledge to models: Automated modeling in systems and synthetic biology.” Current Opinion in Systems Biology, 28, Pp. 100362. Publisher's VersionAbstract
Building computational models of biological mechanisms involves collecting and synthesizing knowledge about the underlying system and encoding it in an appropriate mathematical form. While this process typically requires substantial manual effort from human experts, key aspects of the modeling process are increasingly being automated or augmented by software tools, allowing for the efficient creation of large models or model ensembles. In this review, we introduce a framework for discussing modeling automation by positioning recent work into three ‘levels’, with the human and the machine taking on different responsibilities at each level. We outline the strengths and weaknesses of current modeling approaches at the different levels and discuss the prospect of fully automated fit-to-purpose modeling of biological systems.
Benjamin M. Gyori, John A Bachman, and Diana Kolusheva. 2021. “A self-updating causal model of COVID-19 mechanisms built from the scientific literature.” In Proceedings of the BioCreative VII Challenge Evaluation Workshop. Publisher's Version
Marek Ostaszewski, Anna Niarakis, .., Benjamin M Gyori, .., and Reinhard Schneider. 2021. “COVID-19 Disease Map, a computational knowledge repository of SARS-CoV-2 virus-host interaction mechanisms.” Molecular Systems Biology, 17, 10, Pp. e10387. Publisher's Version
Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and L. Stirling Churchman. 2021. “GeneWalk identifies relevant gene functions for a biological context using network representation learning.” Genome Biology, 22, 55. Publisher's VersionAbstract

A bottleneck in high-throughput functional genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Gene Ontology (GO) enrichment methods provide insight at the gene set level. Here, we introduce GeneWalk ( that identifies individual genes and their relevant functions critical for the experimental setting under examination. After the automatic assembly of an experiment-specific gene regulatory network, GeneWalk uses representation learning to quantify the similarity between vector representations of each gene and its GO annotations, yielding annotation significance scores that reflect the experimental context. By performing gene- and condition-specific functional analysis, GeneWalk converts a list of genes into data-driven hypotheses.

Albert Steppi, Benjamin Gyori, and John Bachman. 2020. “Adeft: Acromine-based Disambiguation of Entities from Text with applications to the biomedical literature.” Journal of Open Source Software, 5, 45, Pp. 1708. Publisher's Version
Nienke Moret, Changchang Liu, Benjamin M. Gyori, John A. Bachman, Albert Steppi, Rahil Taujale, Liang-Chin Huang, Clemens Hug, Matt Berginski, Shawn Gomez, Natarajan Kannan, and Peter K. Sorger. 2020. “Exploring the understudied human kinome for research and therapeutic opportunities.” bioRxiv. Publisher's VersionAbstract
The functions of protein kinases have been heavily studied and inhibitors for many human kinases have been developed into FDA-approved therapeutics. A substantial fraction of the human kinome is nonetheless understudied. In this paper, members of the NIH Understudied Kinome Consortium mine public data on “dark” kinases to estimate the likelihood that they are functional. We start with a re-analysis of the human kinome and describe the criteria for creation of an inclusive set of 710 kinase domains and a curated set of 557 protein kinase like (PKL) domains. Nearly all PKLs are expressed in one or more CCLE cell lines and a substantial number are also essential in the Cancer Dependency Map. Dark kinases are frequently differentially expressed or mutated in The Cancer Genome Atlas and other disease databases and investigational and approved kinase inhibitors appear to inhibit them as off-target activities. Thus, it seems likely that the dark human kinome contains multiple biologically important genes, a subset of which may be viable drug targets.
Kee-Myoung Nam, Benjamin M. Gyori, Silviana V. Amethyst, Daniel J. Bates, and Jeremy Gunawardena. 2020. “Robustness and parameter geography in post-translational modification systems.” PLOS Computational Biology, 16, 5, Pp. 1-50. Publisher's VersionAbstract
Author summary Biological organisms are often said to have robust properties but it is difficult to understand how such robustness arises from molecular interactions. Here, we use a mathematical model to study how the molecular mechanism of protein modification exhibits the property of multiple internal states, which has been suggested to underlie memory and decision making. The robustness of this property is revealed by the size and shape, or “geography,” of the parametric region in which the property holds. We use advances in reducing model complexity and in rapidly solving the underlying equations, to extensively sample parameter points in an 8-dimensional space. We find that under realistic molecular assumptions the size of the region is surprisingly small, suggesting that generating multiple internal states with such a mechanism is much harder than expected. While the shape of the region appears straightforward, we find surprising complexity in how the region grows with increasing amounts of the modified substrate. Our approach uses statistical analysis of data generated from a model, rather than from experiments, but leads to precise mathematical conjectures about parameter geography and biological robustness.
Rebecca Sharp, Adarsh Pyarelal, Benjamin M. Gyori, and et al. 6/2019. “Eidos, INDRA & Delphi: From Free Text to Executable Causal Models.” In NAACL, Pp. 42-47. Association for Computational Linguistics. Publisher's VersionAbstract
Building causal models of complicated phenomena such as food insecurity is currently a slow and labor-intensive manual process. In this paper, we introduce an approach that builds executable probabilistic models from raw, free text. The proposed approach is implemented through three systems: Eidos, INDRA, and Delphi. Eidos is an open-domain machine reading system designed to extract causal relations from natural language. It is rule-based, allowing for rapid domain transfer, customizability, and interpretability. INDRA aggregates multiple sources of causal information and performs assembly to create a coherent knowledge base and assess its reliability. This assembled knowledge serves as the starting point for modeling. Delphi is a modeling framework that assembles quantified causal fragments and their contexts into executable probabilistic models that respect the semantics of the original text, and can be used to support decision making.
John A. Bachman, Benjamin M. Gyori, and Peter K. Sorger. 2019. “Assembling a phosphoproteomic knowledge base using ProtMapper to normalize phosphosite information from databases and text mining.” bioRxiv. Publisher's VersionAbstract
A major challenge in analyzing large phosphoproteomic datasets is that information on phosphorylating kinases and other upstream regulators is limited to a small fraction of phosphosites. One approach to addressing this problem is to aggregate and normalize information from all available information sources, including both curated databases and large-scale text mining. However, when we attempted to aggregate information on post-translational modifications (PTMs) from six databases and three text mining systems, we found that a substantial proportion of phosphosites were positioned on non-canonical residue positions. These errors were attributable to the use of residue numbers from non-canonical isoforms, mouse or rat proteins, post-translationally processed proteins and also from errors in curation and text mining. Published mass spectrometry datasets from large-scale efforts such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) also localize many PTMs to non-canonical sequences, precluding their accurate annotation. To address these problems, we developed ProtMapper, an open-source Python tool that automatically normalizes site positions to human protein reference sequences using data from PhosphoSitePlus and Uniprot. ProtMapper identifies valid reference positions with high precision and reasonable recall, making it possible to filter out machine reading errors from text mining and thereby assemble a corpus of 29,400 regulatory annotations for 13,668 sites, a 2.8-fold increase over PhosphoSitePlus, the current gold standard. To our knowledge this corpus represents the most comprehensive source of literature-derived information about phosphosite regulation currently available and its assembly illustrates the importance of sequence normalization. Combining the expanded corpus of annotations with normalization of CPTAC data nearly doubled the number of CPTAC annotated sites and the mean number of annotations per site. ProtMapper is available under an open source BSD 2-clause license at, and the corpus of phosphosite annotations is available as Supplementary Data with this paper under a CC-BY-NC-SA license. All results from the paper are reproducible from code available at Summary Phosphorylation is a type of chemical modification that can affect the activity, interactions, or cellular location of proteins. Experimentally measured patterns of protein phosphorylation can be used to infer the mechanisms of cell behavior and disease, but this type of analysis depends on the availability of functional information about the regulation and effects of individual phosphorylation sites. In this study we show that inconsistent descriptions of the physical locations of phosphorylation sites on proteins present a barrier to the functional analysis of phosphorylation data. These inconsistencies are found in both pathway databases and text mining results and often come from the underlying scientific publications. We describe a method to normalize phosphosite locations to standard human protein sequences and use this method to robustly aggregate information from many sources. The result is a large body of functional annotations that increases the proportion of phosphosites with known regulators in two large experimental surveys of phosphorylation in cancer.