Albert Steppi, Benjamin Gyori, and John Bachman. 2020. “Adeft: Acromine-based Disambiguation of Entities from Text with applications to the biomedical literature.” Journal of Open Source Software, 5, 45, Pp. 1708. Publisher's Version
Nienke Moret, Changchang Liu, Benjamin M. Gyori, John A. Bachman, Albert Steppi, Rahil Taujale, Liang-Chin Huang, Clemens Hug, Matt Berginski, Shawn Gomez, Natarajan Kannan, and Peter K. Sorger. 2020. “Exploring the understudied human kinome for research and therapeutic opportunities.” bioRxiv. Publisher's VersionAbstract
The functions of protein kinases have been heavily studied and inhibitors for many human kinases have been developed into FDA-approved therapeutics. A substantial fraction of the human kinome is nonetheless understudied. In this paper, members of the NIH Understudied Kinome Consortium mine public data on “dark” kinases to estimate the likelihood that they are functional. We start with a re-analysis of the human kinome and describe the criteria for creation of an inclusive set of 710 kinase domains and a curated set of 557 protein kinase like (PKL) domains. Nearly all PKLs are expressed in one or more CCLE cell lines and a substantial number are also essential in the Cancer Dependency Map. Dark kinases are frequently differentially expressed or mutated in The Cancer Genome Atlas and other disease databases and investigational and approved kinase inhibitors appear to inhibit them as off-target activities. Thus, it seems likely that the dark human kinome contains multiple biologically important genes, a subset of which may be viable drug targets.
Kee-Myoung Nam, Benjamin M. Gyori, Silviana V. Amethyst, Daniel J. Bates, and Jeremy Gunawardena. 2020. “Robustness and parameter geography in post-translational modification systems.” PLOS Computational Biology, 16, 5, Pp. 1-50. Publisher's VersionAbstract
Author summary Biological organisms are often said to have robust properties but it is difficult to understand how such robustness arises from molecular interactions. Here, we use a mathematical model to study how the molecular mechanism of protein modification exhibits the property of multiple internal states, which has been suggested to underlie memory and decision making. The robustness of this property is revealed by the size and shape, or “geography,” of the parametric region in which the property holds. We use advances in reducing model complexity and in rapidly solving the underlying equations, to extensively sample parameter points in an 8-dimensional space. We find that under realistic molecular assumptions the size of the region is surprisingly small, suggesting that generating multiple internal states with such a mechanism is much harder than expected. While the shape of the region appears straightforward, we find surprising complexity in how the region grows with increasing amounts of the modified substrate. Our approach uses statistical analysis of data generated from a model, rather than from experiments, but leads to precise mathematical conjectures about parameter geography and biological robustness.
Rebecca Sharp, Adarsh Pyarelal, Benjamin M. Gyori, and et al. 6/2019. “Eidos, INDRA & Delphi: From Free Text to Executable Causal Models.” In NAACL, Pp. 42-47. Association for Computational Linguistics. Publisher's VersionAbstract
Building causal models of complicated phenomena such as food insecurity is currently a slow and labor-intensive manual process. In this paper, we introduce an approach that builds executable probabilistic models from raw, free text. The proposed approach is implemented through three systems: Eidos, INDRA, and Delphi. Eidos is an open-domain machine reading system designed to extract causal relations from natural language. It is rule-based, allowing for rapid domain transfer, customizability, and interpretability. INDRA aggregates multiple sources of causal information and performs assembly to create a coherent knowledge base and assess its reliability. This assembled knowledge serves as the starting point for modeling. Delphi is a modeling framework that assembles quantified causal fragments and their contexts into executable probabilistic models that respect the semantics of the original text, and can be used to support decision making.
John A. Bachman, Benjamin M. Gyori, and Peter K. Sorger. 2019. “Assembling a phosphoproteomic knowledge base using ProtMapper to normalize phosphosite information from databases and text mining.” bioRxiv. Publisher's VersionAbstract
A major challenge in analyzing large phosphoproteomic datasets is that information on phosphorylating kinases and other upstream regulators is limited to a small fraction of phosphosites. One approach to addressing this problem is to aggregate and normalize information from all available information sources, including both curated databases and large-scale text mining. However, when we attempted to aggregate information on post-translational modifications (PTMs) from six databases and three text mining systems, we found that a substantial proportion of phosphosites were positioned on non-canonical residue positions. These errors were attributable to the use of residue numbers from non-canonical isoforms, mouse or rat proteins, post-translationally processed proteins and also from errors in curation and text mining. Published mass spectrometry datasets from large-scale efforts such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) also localize many PTMs to non-canonical sequences, precluding their accurate annotation. To address these problems, we developed ProtMapper, an open-source Python tool that automatically normalizes site positions to human protein reference sequences using data from PhosphoSitePlus and Uniprot. ProtMapper identifies valid reference positions with high precision and reasonable recall, making it possible to filter out machine reading errors from text mining and thereby assemble a corpus of 29,400 regulatory annotations for 13,668 sites, a 2.8-fold increase over PhosphoSitePlus, the current gold standard. To our knowledge this corpus represents the most comprehensive source of literature-derived information about phosphosite regulation currently available and its assembly illustrates the importance of sequence normalization. Combining the expanded corpus of annotations with normalization of CPTAC data nearly doubled the number of CPTAC annotated sites and the mean number of annotations per site. ProtMapper is available under an open source BSD 2-clause license at, and the corpus of phosphosite annotations is available as Supplementary Data with this paper under a CC-BY-NC-SA license. All results from the paper are reproducible from code available at Summary Phosphorylation is a type of chemical modification that can affect the activity, interactions, or cellular location of proteins. Experimentally measured patterns of protein phosphorylation can be used to infer the mechanisms of cell behavior and disease, but this type of analysis depends on the availability of functional information about the regulation and effects of individual phosphorylation sites. In this study we show that inconsistent descriptions of the physical locations of phosphorylation sites on proteins present a barrier to the functional analysis of phosphorylation data. These inconsistencies are found in both pathway databases and text mining results and often come from the underlying scientific publications. We describe a method to normalize phosphosite locations to standard human protein sequences and use this method to robustly aggregate information from many sources. The result is a large body of functional annotations that increases the proportion of phosphosites with known regulators in two large experimental surveys of phosphorylation in cancer.
Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and L. Stirling Churchman. 2019. “GeneWalk identifies relevant gene functions for a biological context using network representation learning.” bioRxiv. Publisher's VersionAbstract
The primary bottleneck in high-throughput genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Existing methods such as Gene Ontology (GO) enrichment analysis provide insight at the gene set level. For individual genes, GO annotations are static and biological context can only be added by manual literature searches. Here, we introduce GeneWalk (, a method that identifies individual genes and their relevant functions under a particular experimental condition. After automatic assembly of an experiment-specific gene regulatory network, GeneWalk quantifies the similarity between vector representations of each gene and its GO annotations through representation learning, yielding annotation significance scores that reflect their functional relevance for the experimental context. We demonstrate the use of GeneWalk analysis of RNA-seq and nascent transcriptome (NET-seq) data from human cells and mouse brains, validating the methodology. By performing gene- and condition-specific functional analysis that converts a list of genes into data-driven hypotheses, GeneWalk accelerates the interpretation of high-throughput genetics experiments.
Petar V. Todorov, Benjamin M. Gyori, John A. Bachman, and Peter K. Sorger. 2019. “INDRA-IPM: interactive pathway modeling using natural language with automated assembly.” Bioinformatics, Pp. btz289. Publisher's VersionAbstract


INDRA-IPM (Interactive Pathway Map) is a web-based pathway map modeling tool that combines natural language processing with automated model assembly and visualization. INDRA-IPM contextualizes models with expression data and exports them to standard formats.

Availability and implementation

INDRA-IPM is available at: Source code is available at The underlying web service API is available at

Supplementary information

Supplementary data are available at Bioinformatics online.

Charles Hoyt, Daniel Domingo-Fernandez, Rana Aldisi, Lingling Xu, Kristian Kolpeja, Sandra Spalek, Esther Wollert, John Bachman, Benjamin Gyori, Patrick Greene, and Martin Hofmann-Apitius. 2019. “Re-curation and Rational Enrichment of Knowledge Graphs in Biological Expression Language.” Database (in print), 2019, Pp. baz068. Publisher's VersionAbstract
The rapid accumulation of new biomedical literature not only causes curated knowledge graphs (KGs) to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich KGs. We have developed two workflows: one for re-curating a given KG to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the KGs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full-text articles using text mining output integrated by INDRA. We have made this workflow freely available at
Bing Liu, Benjamin M. Gyori, and P. S. Thiagarajan. 2019. “Statistical Model Checking based Analysis of Biological Networks.” In Automated Reasoning for Systems Biology and Medicine, Pp. 63-92. Springer. Publisher's VersionAbstract
We introduce a framework for analyzing ordinary differential equation  (ODE) models of biological networks using statistical model checking (SMC). A key aspect of our work is the modeling of single-cell variability by assigning a probability distribution to intervals of initial concentration values and kinetic rate constants. We propagate this distribution through the system dynamics to obtain a distribution over the set of trajectories of the ODEs. This in turn opens the door for performing statistical analysis of the ODE system’s behavior. To illustrate this, we first encode quantitative data and qualitative trends as bounded linear time temporal logic (BLTL) formulas. Based on this, we construct a parameter estimation method using an SMC-driven evaluation procedure applied to the stochastic version of the behavior of the ODE system. We then describe how this SMC framework can be generalized to hybrid automata by exploiting the given distribution over the initial states and the—much more sophisticated—system dynamics to associate a Markov chain with the hybrid automaton. We then establish a strong relationship between the behaviors of the hybrid automaton and its associated Markov chain. Consequently, we sample trajectories from the hybrid automaton in a way that mimics the sampling of the trajectories of the Markov chain. This enables us to verify approximately that the Markov chain meets a BLTL specification with high probability. We have applied these methods to ODE-based models of Toll-like receptor signaling and the crosstalk between autophagy and apoptosis, as well as to systems exhibiting hybrid dynamics including the circadian clock pathway and cardiac cell physiology. We present an overview of these applications and summarize the main empirical results. These case studies demonstrate that our methods can be applied in a variety of practical settings.
Somponnat Sampattavanich, Bernhard Steiert, Bernhard A. Kramer, Benjamin M. Gyori, John G. Albeck, and Peter K. Sorger. 2018. “Encoding Growth Factor Identity in the Temporal Dynamics of FOXO3 under the Combinatorial Control of ERK and AKT Kinases.” Cell Systems, 6, 6, Pp. 664-678.
John A Bachman, Benjamin M Gyori, and Peter K Sorger. 2018. “FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.” BMC Bioinformatics, 19, 248.Abstract

Background: For automated reading of scientific publications to
extract useful information about molecular mechanisms it is critical that
genes, proteins and other entities be correctly associated with uniform
identifiers, a process known as named entity linking or "grounding.'' Correct
grounding is essential for resolving relationships among mined information,
curated interaction databases, and biological datasets. The accuracy of this
process is largely dependent on the availability of machine-readable resources
associating synonyms and abbreviations commonly found in biomedical literature
with uniform identifiers.

Results: In a task involving automated reading of  ~215,000
articles using the REACH event extraction software we found that grounding was
disproportionately inaccurate for multi-protein families (e.g., "AKT") and
complexes with multiple subunits  (e.g."NF-kappaB'"). To address this
problem we constructed FamPlex, a manually curated resource defining protein
families and complexes as they are commonly encountered  in biomedical text. In
FamPlex the gene-level constituents of families and complexes are defined in a
flexible format allowing for multi-level, hierarchical membership. To create
FamPlex, text strings corresponding to entities were identified empirically
from literature and linked manually to uniform identifiers; these identifiers
were also mapped to equivalent entries in multiple related databases. FamPlex
also includes curated prefix and suffix patterns that improve named entity
recognition and event extraction.  Evaluation of REACH extractions on a test
corpus of ~54,000 articles showed that FamPlex significantly increased
grounding accuracy for families and complexes (from 15% to 71%). The
hierarchical organization of entities in FamPlex also made it possible to
integrate otherwise unconnected mechanistic information across families,
subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM
reading system and the Biocreative VI Bioentity Normalization Task dataset
demonstrated the utility of FamPlex in other settings.

Conclusion: FamPlex is an effective resource for improving named
entity recognition, grounding, and relationship resolution in automated reading
of biomedical text. The content in FamPlex is available in both tabular and
Open Biomedical Ontology formats at under the Creative Commons CC0
license and has been integrated into the TRIPS/DRUM and REACH reading systems.

Benjamin M. Gyori, John A. Bachman, Kartik Subramanian, Jeremy L. Muhlich, Lucian Galescu, and Peter K. Sorger. 11/2017. “From word models to executable models of signaling networks using automated assembly.” Molecular Systems Biology, 13, 11, Pp. 954. Publisher's Version
Bejnamin M Gyori and Daniel Paulin. 2015. “Hypothesis testing for Markov chain Monte Carlo.” Statistics and Computing,, Pp. 1–12.
R. Ramanathan, Yan Zhang, Jun Zhou, Benjamin M. Gyori, Weng-Fai Wong, and P. S. Thiagarajan. 2015. “Parallelized Parameter Estimation of Biological Pathway Models.” Hybrid Systems Biology, 9271, Pp. 37-57.
Yan Zhang, Sriram Sankaranarayanan, and Benjamin M Gyori. 2015. “Simulation-Guided Parameter Synthesis for Chance-Constrained Optimization of Control Systems.” In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, Pp. 208–215. IEEE Press.
Benjamin M Gyori, Bing Liu, Soumya Paul, R Ramanathan, and PS Thiagarajan. 2015. “Approximate probabilistic verification of hybrid systems.” Hybrid Systems Biology, 9271, Pp. 96–116.
Benjamin M. Gyori, Gireedhar Venkatachalam, P. S. Thiagarajan, David Hsu, and Marie-Veronique Clement. 2014. “OpenComet: an automated tool for comet assay image analysis.” Redox Biology, 2, Pp. 457-465.
Benjamin M Gyori, Daniel Paulin, and Sucheendra K Palaniappan. 2014. “Probabilistic verification of partially observable dynamical systems.” arXiv preprint arXiv:1411.0976.
Benjamin M. Gyori. 2014. “Probabilistic Approaches to Modeling Uncertainty in Biological Pathway Dynamics.” National University of Singapore.
Sucheendra K. Palaniappan, Benjamin M. Gyori, Bing Liu, David Hsu, and P. S. Thiagarajan. 2013. “Statistical model checking based calibration and analysis of bio-pathway models.” In Proceedings of the 11th International Conference on Computational Methods in Systems Biology, CMSB 13. Austria.