Building computational models of biological mechanisms involves collecting and synthesizing knowledge about the underlying system and encoding it in an appropriate mathematical form. While this process typically requires substantial manual effort from human experts, key aspects of the modeling process are increasingly being automated or augmented by software tools, allowing for the efficient creation of large models or model ensembles. In this review, we introduce a framework for discussing modeling automation by positioning recent work into three ‘levels’, with the human and the machine taking on different responsibilities at each level. We outline the strengths and weaknesses of current modeling approaches at the different levels and discuss the prospect of fully automated fit-to-purpose modeling of biological systems.
Summary Gilda is a software tool and web service which implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity.Availability The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials are available via https://github.com/indralab/gilda.Contact benjamin_gyoriathms.harvard.eduCompeting Interest StatementThe authors have declared no competing interest.
Deubiquitinating enzymes (DUBs) are proteases that remove ubiquitin conjugates from proteins, thereby regulating protein turnover. Inhibition of DUBs promises to make classically undruggable targets such as the tumor suppressor TP53 and oncogene c-Myc amenable to regulation by small molecules. However, the majority of substrates and pathways regulated by DUBs remain unknown, impeding efforts to prioritize specific enzymes for research and drug development. To assemble a knowledgebase of DUB activities, co-dependent genes, and substrates, we combined targeted experiments using CRISPR libraries and inhibitors with systematic mining of functional genomic databases. Analysis of the Dependency Map, Connectivity Map, Cancer Cell Line Encyclopedia, and protein-protein interaction databases yielded specific hypotheses about DUB function, a subset of which were confirmed in follow-on experiments. The data in this paper, which are browsable online via the DUB Portal, promise to improve understanding of DUBs as a family as well as the activities of specific DUBs such as USP14, UCHL5 and USP7, which have been targeted with investigational cancer therapeutics.Competing Interest StatementPKS is a member of the SAB or BOD member of Applied Biomath, RareCyte Inc., and Glencoe Software; PKS is also a member of the NanoString and Montain Health SABs. In the last five years the Sorger lab has received research funding from Novartis and Merck. Sorger declares that none of these relationships have influenced the content of this manuscript. SJB is a member of the SAB of Adenoid Cystic Carcinoma Foundation. In the last five years the Buhrlage lab has received research funding from AbbVie and in-kind resources from Novartis Institutes for Biomedical Research. Buhrlage declares that none of these relationships have influenced the content of this manuscript. SAB is currently an employee of Ginkgo Bioworks, she declares no conflicts of interest. The other authors declare no outside interests.
The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.Competing Interest StatementDaniel Domingo-Fernandez received salary from Enveda Biosciences.
A bottleneck in high-throughput functional genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Gene Ontology (GO) enrichment methods provide insight at the gene set level. Here, we introduce GeneWalk (github.com/churchmanlab/genewalk) that identifies individual genes and their relevant functions critical for the experimental setting under examination. After the automatic assembly of an experiment-specific gene regulatory network, GeneWalk uses representation learning to quantify the similarity between vector representations of each gene and its GO annotations, yielding annotation significance scores that reflect the experimental context. By performing gene- and condition-specific functional analysis, GeneWalk converts a list of genes into data-driven hypotheses.
The functions of protein kinases have been heavily studied and inhibitors for many human kinases have been developed into FDA-approved therapeutics. A substantial fraction of the human kinome is nonetheless understudied. In this paper, members of the NIH Understudied Kinome Consortium mine public data on “dark” kinases to estimate the likelihood that they are functional. We start with a re-analysis of the human kinome and describe the criteria for creation of an inclusive set of 710 kinase domains and a curated set of 557 protein kinase like (PKL) domains. Nearly all PKLs are expressed in one or more CCLE cell lines and a substantial number are also essential in the Cancer Dependency Map. Dark kinases are frequently differentially expressed or mutated in The Cancer Genome Atlas and other disease databases and investigational and approved kinase inhibitors appear to inhibit them as off-target activities. Thus, it seems likely that the dark human kinome contains multiple biologically important genes, a subset of which may be viable drug targets.
Author summary Biological organisms are often said to have robust properties but it is difficult to understand how such robustness arises from molecular interactions. Here, we use a mathematical model to study how the molecular mechanism of protein modification exhibits the property of multiple internal states, which has been suggested to underlie memory and decision making. The robustness of this property is revealed by the size and shape, or “geography,” of the parametric region in which the property holds. We use advances in reducing model complexity and in rapidly solving the underlying equations, to extensively sample parameter points in an 8-dimensional space. We find that under realistic molecular assumptions the size of the region is surprisingly small, suggesting that generating multiple internal states with such a mechanism is much harder than expected. While the shape of the region appears straightforward, we find surprising complexity in how the region grows with increasing amounts of the modified substrate. Our approach uses statistical analysis of data generated from a model, rather than from experiments, but leads to precise mathematical conjectures about parameter geography and biological robustness.
Building causal models of complicated phenomena such as food insecurity is currently a slow and labor-intensive manual process. In this paper, we introduce an approach that builds executable probabilistic models from raw, free text. The proposed approach is implemented through three systems: Eidos, INDRA, and Delphi. Eidos is an open-domain machine reading system designed to extract causal relations from natural language. It is rule-based, allowing for rapid domain transfer, customizability, and interpretability. INDRA aggregates multiple sources of causal information and performs assembly to create a coherent knowledge base and assess its reliability. This assembled knowledge serves as the starting point for modeling. Delphi is a modeling framework that assembles quantified causal fragments and their contexts into executable probabilistic models that respect the semantics of the original text, and can be used to support decision making.
A major challenge in analyzing large phosphoproteomic datasets is that information on phosphorylating kinases and other upstream regulators is limited to a small fraction of phosphosites. One approach to addressing this problem is to aggregate and normalize information from all available information sources, including both curated databases and large-scale text mining. However, when we attempted to aggregate information on post-translational modifications (PTMs) from six databases and three text mining systems, we found that a substantial proportion of phosphosites were positioned on non-canonical residue positions. These errors were attributable to the use of residue numbers from non-canonical isoforms, mouse or rat proteins, post-translationally processed proteins and also from errors in curation and text mining. Published mass spectrometry datasets from large-scale efforts such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) also localize many PTMs to non-canonical sequences, precluding their accurate annotation. To address these problems, we developed ProtMapper, an open-source Python tool that automatically normalizes site positions to human protein reference sequences using data from PhosphoSitePlus and Uniprot. ProtMapper identifies valid reference positions with high precision and reasonable recall, making it possible to filter out machine reading errors from text mining and thereby assemble a corpus of 29,400 regulatory annotations for 13,668 sites, a 2.8-fold increase over PhosphoSitePlus, the current gold standard. To our knowledge this corpus represents the most comprehensive source of literature-derived information about phosphosite regulation currently available and its assembly illustrates the importance of sequence normalization. Combining the expanded corpus of annotations with normalization of CPTAC data nearly doubled the number of CPTAC annotated sites and the mean number of annotations per site. ProtMapper is available under an open source BSD 2-clause license at https://github.com/indralab/protmapper, and the corpus of phosphosite annotations is available as Supplementary Data with this paper under a CC-BY-NC-SA license. All results from the paper are reproducible from code available at https://github.com/johnbachman/protmapper_paper.Author Summary Phosphorylation is a type of chemical modification that can affect the activity, interactions, or cellular location of proteins. Experimentally measured patterns of protein phosphorylation can be used to infer the mechanisms of cell behavior and disease, but this type of analysis depends on the availability of functional information about the regulation and effects of individual phosphorylation sites. In this study we show that inconsistent descriptions of the physical locations of phosphorylation sites on proteins present a barrier to the functional analysis of phosphorylation data. These inconsistencies are found in both pathway databases and text mining results and often come from the underlying scientific publications. We describe a method to normalize phosphosite locations to standard human protein sequences and use this method to robustly aggregate information from many sources. The result is a large body of functional annotations that increases the proportion of phosphosites with known regulators in two large experimental surveys of phosphorylation in cancer.
INDRA-IPM (Interactive Pathway Map) is a web-based pathway map modeling tool that combines natural language processing with automated model assembly and visualization. INDRA-IPM contextualizes models with expression data and exports them to standard formats.
The rapid accumulation of new biomedical literature not only causes curated knowledge graphs (KGs) to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich KGs. We have developed two workflows: one for re-curating a given KG to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the KGs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full-text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.
We introduce a framework for analyzing ordinary differential equation (ODE) models of biological networks using statistical model checking (SMC). A key aspect of our work is the modeling of single-cell variability by assigning a probability distribution to intervals of initial concentration values and kinetic rate constants. We propagate this distribution through the system dynamics to obtain a distribution over the set of trajectories of the ODEs. This in turn opens the door for performing statistical analysis of the ODE system’s behavior. To illustrate this, we first encode quantitative data and qualitative trends as bounded linear time temporal logic (BLTL) formulas. Based on this, we construct a parameter estimation method using an SMC-driven evaluation procedure applied to the stochastic version of the behavior of the ODE system. We then describe how this SMC framework can be generalized to hybrid automata by exploiting the given distribution over the initial states and the—much more sophisticated—system dynamics to associate a Markov chain with the hybrid automaton. We then establish a strong relationship between the behaviors of the hybrid automaton and its associated Markov chain. Consequently, we sample trajectories from the hybrid automaton in a way that mimics the sampling of the trajectories of the Markov chain. This enables us to verify approximately that the Markov chain meets a BLTL specification with high probability. We have applied these methods to ODE-based models of Toll-like receptor signaling and the crosstalk between autophagy and apoptosis, as well as to systems exhibiting hybrid dynamics including the circadian clock pathway and cardiac cell physiology. We present an overview of these applications and summarize the main empirical results. These case studies demonstrate that our methods can be applied in a variety of practical settings.
Background: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or "grounding.'' Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers.
Results: In a task involving automated reading of ~215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., "AKT") and complexes with multiple subunits (e.g."NF-kappaB'"). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ~54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15% to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings.
Conclusion: FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.