%0 Journal Article
%J Bioinformatics
%D 2023
%T Prediction and Curation of Missing Biomedical Identifier Mappings with Biomappings
%A Hoyt, Charles Tapley
%A Amelia Hoyt
%A Benjamin M. Gyori
%B Bioinformatics
%P btad130
%8 2022/12/02
%G eng
%U https://doi.org/10.1093/bioinformatics/btad130
%0 Journal Article
%J Bioinformatics
%D 2023
%T NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange
%A Rudolf T Pillich
%A Jing Chen
%A Christopher Churras
%A Dylan Fong
%A Ideker, Trey
%A Sophie N Liu
%A Gyori, Benjamin M
%A Karis, Klas
%A Keiichiro Ono
%A Pico, Alexander
%A Dexter Pratt
%B Bioinformatics
%P btad118
%G eng
%U https://doi.org/10.1093/bioinformatics/btad118
%0 Journal Article
%J Molecular Systems Biology
%D 2023
%T Automated assembly of molecular mechanisms at scale from text mining and curated databases
%A John A. Bachman
%A Gyori, Benjamin M
%A Peter K. Sorger
%X The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein–protein interaction databases and explain co-dependencies in the Cancer Dependency Map.
%B Molecular Systems Biology
%I Cold Spring Harbor Laboratory
%P e11325
%G eng
%U https://doi.org/10.15252/msb.202211325
%R 10.1101/2022.08.30.505688
%0 Journal Article
%J arXiv preprint
%D 2023
%T Democratising Knowledge Representation with BioCypher
%A Lobentanzer, Sebastian
%A Aloy, Patrick
%A Baumbach, Jan
%A Bohar, Balazs
%A Charoentong, Pornpimol
%A Danhauser, Katharina
%A Doğan, Tunca
%A Dreo, Johann
%A Dunham, Ian
%A Fernandez-Torras, Adrià
%A Benjamin M. Gyori
%A Hartung, Michael
%A Hoyt, Charles Tapley
%A Klein, Christoph
%A Korcsmaros, Tamas
%A Andreas Maier
%A Mann, Matthias
%A Ochoa, David
%A Pareja-Lorente, Elena
%A Popp, Ferdinand
%A Preusse, Martin
%A Probul, Niklas
%A Schwikowski, Benno
%A Sen, Bünyamin
%A Strauss, Maximilian T.
%A Turei, Denes
%A Ulusoy, Erva
%A Wodke, Judith Andrea Heidrun
%A Saez-Rodriguez, Julio
%K FOS: Biological sciences
%K Molecular Networks (q-bio.MN)
%B arXiv preprint
%I arXiv
%G eng
%U https://arxiv.org/abs/2212.13543
%R 10.48550/ARXIV.2212.13543
%0 Journal Article
%J bioRxiv
%D 2023
%T Nociceptor neuroimmune interactomes reveal cell type- and injury-specific inflammatory pain pathways
%A Jain, Aakanksha
%A Benjamin M. Gyori
%A Hakim, Sara
%A Bunga, Samuel
%A Taub, Daniel G
%A Ruiz-Cantero, Mari Carmen
%A Tong-Li, Candace
%A Andrews, Nicholas
%A Sorger, Peter K
%A Woolf, Clifford J
%B bioRxiv
%I Cold Spring Harbor Laboratory
%G eng
%U https://www.biorxiv.org/content/early/2023/02/03/2023.02.01.526526
%R 10.1101/2023.02.01.526526
%0 Journal Article
%J Scientific Data
%D 2022
%T Unifying the Identification of Biomedical Entities with the Bioregistry
%A Hoyt, Charles Tapley
%A Meghan Balk
%A Callahan, Tiffany J
%A Domingo-Fernández, Daniel
%A Melissa A. Haendel
%A Harshad B. Hegde
%A Daniel S. Himmelstein
%A Karis, Klas
%A John Kunze
%A Tiago Lubiana
%A Nicolas Matentzoglu
%A Julie McMurry
%A Sierra Moxon
%A Christopher J. Mungall
%A Adriano Rutz
%A Deepak R. Unni
%A Egon Willighagen
%A Donald Winston
%A Benjamin M. Gyori
%B Scientific Data
%V 9
%P 714
%8 July 2022
%G eng
%U https://www.nature.com/articles/s41597-022-01807-3
%N 1
%0 Journal Article
%J eLife
%D 2022
%T Integrating multi-omics data reveals function and therapeutic potential of deubiquitinating enzymes
%A Laura M. Doherty
%A Caitlin E.Mills
%A Sarah A. Boswell
%A Xiaoxi Liu
%A Hoyt, Charles Tapley
%A Benjamin M. Gyori
%A Sara J. Buhrlage
%A Peter K. Sorger
%B eLife
%I Cold Spring Harbor Laboratory
%V 11
%G eng
%U https://doi.org/10.7554/eLife.72879
%N e72879
%R 10.1101/2021.08.06.455458
%0 Journal Article
%J Bioinformatics Advances
%D 2022
%T Gilda: biomedical entity text normalization with machine-learned disambiguation as a service
%A Benjamin M. Gyori
%A Hoyt, Charles Tapley
%A Steppi, Albert
%X Summary Gilda is a software tool and web service which implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity.Availability The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials are available via https://github.com/indralab/gilda.Contact benjamin_gyoriathms.harvard.eduCompeting Interest StatementThe authors have declared no competing interest.
%B Bioinformatics Advances
%I Cold Spring Harbor Laboratory
%G eng
%U https://doi.org/10.1093/bioadv/vbac034
%N vbac034
%R 10.1101/2021.09.10.459803
%0 Journal Article
%J Environmental Health Perspectives
%D 2022
%T Automated Network Assembly of Mechanistic Literature for Informed Evidence Identification to Support Cancer Risk Assessment
%A Bernice Scholten
%A Laura Guerrero Simón
%A Shaji Krishnan
%A Vermeulen, Roel
%A Anjoeka Pronk
%A Benjamin M. Gyori
%A John A. Bachman
%A Jelle Vlaanderen
%A Rob Stierum
%B Environmental Health Perspectives
%V 130
%P 037002
%G eng
%U https://doi.org/10.1289/EHP9112
%N 3
%0 Journal Article
%J Journal of Open Source Software
%D 2022
%T PyBioPAX: biological pathway exchange in Python
%A Benjamin M. Gyori
%A Hoyt, Charles Tapley
%B Journal of Open Source Software
%I The Open Journal
%V 7
%P 4136
%G eng
%U https://doi.org/10.21105/joss.04136
%N 71
%R 10.21105/joss.04136
%0 Journal Article
%J Database
%D 2022
%T A roadmap for the functional annotation of protein families: a community perspective
%A de Crécy-lagard, Valérie
%A Amorin de Hegedus, Rocio
%A Arighi, Cecilia
%A Babor, Jill
%A Bateman, Alex
%A Blaby, Ian
%A Blaby-Haas, Crysten
%A Bridge, Alan J
%A Burley, Stephen K
%A Cleveland, Stacey
%A Colwell, Lucy J
%A Conesa, Ana
%A Christian Dallago
%A Danchin, Antoine
%A de Waard, Anita
%A Deutschbauer, Adam
%A Dias, Raquel
%A Ding, Yousong
%A Fang, Gang
%A Friedberg, Iddo
%A Gerlt, John
%A Goldford, Joshua
%A Gorelik, Mark
%A Gyori, Benjamin M
%A Henry, Christopher
%A Hutinet, Geoffrey
%A Jaroch, Marshall
%A Karp, Peter D
%A Kondratova, Liudmyla
%A Lu, Zhiyong
%A Marchler-Bauer, Aron
%A Martin, Maria-Jesus
%A McWhite, Claire
%A Moghe, Gaurav D
%A Monaghan, Paul
%A Morgat, Anne
%A Mungall, Christopher J
%A Natale, Darren A
%A Nelson, William C
%A O’Donoghue, Seán
%A Orengo, Christine
%A O’Toole, Katherine H
%A Radivojac, Predrag
%A Reed, Colbie
%A Roberts, Richard J
%A Rodionov, Dmitri
%A Rodionova, Irina A
%A Rudolf, Jeffrey D
%A Saleh, Lana
%A Sheynkman, Gloria
%A Thibaud-Nissen, Francoise
%A Thomas, Paul D
%A Uetz, Peter
%A Vallenet, David
%A Carter, Erica Watson
%A Weigele, Peter R
%A Wood, Valerie
%A Wood-Charlson, Elisha M
%A Xu, Jin
%X Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3–4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
%B Database
%V 2022
%8 08
%G eng
%U https://doi.org/10.1093/database/baac062
%R 10.1093/database/baac062
%0 Conference Paper
%B NAACL HCI+NLP
%D 2022
%T Taxonomy Builder: a Data-driven and User-centric Tool for Streamlining Taxonomy Construction
%A John Hungerford
%A Yee Seng Chan
%A Jessica MacBride
%A Benjamin M. Gyori
%A Andrew Zupon
%A Zheng Tang
%A Egoitz Laparra
%A Haoling Qiu
%A Bonan Min
%A Yan Zverev
%A Caitlin Hilverman
%A Max Thomas
%A Walt Andrews
%A Keith Alcock
%A Zeyu Zhang
%A Reynolds, Michael
%A Mihai Surdeanu
%A Steve Bethard
%A Rebecca Sharp
%B NAACL HCI+NLP
%C Seattle, Washington
%G eng
%0 Generic
%D 2022
%T A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs
%A Hoyt, Charles Tapley
%A Berrendorf, Max
%A Galkin, Mikhail
%A Tresp, Volker
%A Benjamin M. Gyori
%K Artificial Intelligence (cs.AI)
%K FOS: Computer and information sciences
%K Machine Learning (cs.LG)
%I arXiv
%G eng
%U https://arxiv.org/abs/2203.07544
%R 10.48550/ARXIV.2203.07544
%0 Journal Article
%J bioRxiv
%D 2022
%T Assembling a phosphoproteomic knowledge base using ProtMapper to normalize phosphosite information from databases and text mining
%A John A. Bachman
%A Peter K. Sorger
%A Benjamin M. Gyori
%B bioRxiv
%I Cold Spring Harbor Laboratory
%G eng
%U https://www.biorxiv.org/content/10.1101/822668v4
%R 10.1101/822668
%0 Conference Paper
%B KDD 2022
%D 2022
%T ChemicalX: A Deep Learning Library for Drug Pair Scoring
%A Rozemberczki, Benedek
%A Hoyt, Charles Tapley
%A Gogleva, Anna
%A Grabowski, Piotr
%A Karis, Klas
%A Lamov, Andrej
%A Nikolov, Andriy
%A Nilsson, Sebastian
%A Ughetto, Michael
%A Yu Wang
%A Derr, Tyler
%A Gyori, Benjamin M
%K Artificial Intelligence (cs.AI)
%K FOS: Computer and information sciences
%K Machine Learning (cs.LG)
%B KDD 2022
%G eng
%U https://doi.org/10.1145/3534678.3539023
%R 10.48550/ARXIV.2202.05240
%0 Conference Paper
%B SWAT4HCLS 2022
%D 2022
%T ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge Graphs
%A Balabin, Helena
%A Hoyt, Charles Tapley
%A Benjamin M. Gyori
%A John Bachman
%A Kodamullil, Alpha Tom
%A Martin Hofmann-Apitius
%A Daniel Domingo-Fern\´andez
%X While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowledge graphs, their union can better exploit the advantages of such approaches, ultimately improving representations of biology. Using multimodal transformers for such purposes can improve performance on context dependent classication tasks, as demonstrated by our previous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce ProtSTonKGs, a transformer aimed at learning all-encompassing representations of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark ProtSTonKGs against STonKGs, resulting in improved F1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of information, paving the foundation for future approaches that use multiple modalities for biomedical applications.
%B SWAT4HCLS 2022
%P 103 – 107
%G eng
%U https://nbn-resolving.org/urn:nbn:de:hbz:1044-opus-62113
%0 Journal Article
%J Database
%D 2022
%T A Simple Standard for Sharing Ontological Mappings (SSSOM)
%A Nicolas Matentzoglu
%A James P. Balhoff
%A Susan M. Bello
%A Chris Bizon
%A Matthew H. Brush
%A Tiffany J. Callahan
%A Chute, Christopher G.
%A William D. Duncan
%A Chris T. A. Evelo
%A Gabriel, Davera
%A John Graybeal
%A Alasdair J. G. Gray
%A Benjamin M. Gyori
%A Melissa A. Haendel
%A Henriette Harmse
%A Nomi L. Harris
%A Ian Harrow
%A Harshad Hegde
%A Amelia L. Hoyt
%A Hoyt, Charles Tapley
%A Jiao, Dazhi
%A Ernesto Jiménez-Ruiz
%A Simon Jupp
%A Hyeongsik Kim
%A Sebastian Köhler
%A Thomas Liener
%A Qinqin Long
%A Malone, James
%A James A. McLaughlin
%A Julie A. McMurry
%A Sierra A. T. Moxon
%A Monica C. Munoz-Torres
%A David Osumi-Sutherland
%A James A. Overton
%A Peters, Bjoern
%A Tim E. Putman
%A Núria Queralt-Rosinach
%A Kent A. Shefchek
%A Solbrig, Harold
%A Anne E. Thessen
%A Tania Tudorache
%A Nicole A. Vasilevsky
%A Alex H. Wagner
%A Christopher J. Mungall
%B Database
%V 2022
%G eng
%U https://doi.org/10.1093/database/baac035
%N baac035
%0 Journal Article
%J Bioinformatics
%D 2022
%T STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs
%A Balabin, Helena
%A Hoyt, Charles Tapley
%A Birkenbihl, Colin
%A Gyori, Benjamin M
%A John Bachman
%A Kodamullil, Alpha Tom
%A Plöger, Paul G
%A Martin Hofmann-Apitius
%A Domingo-Fernández, Daniel
%X The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.Competing Interest StatementDaniel Domingo-Fernandez received salary from Enveda Biosciences.
%B Bioinformatics
%I Cold Spring Harbor Laboratory
%G eng
%U https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac001/6497782
%R 10.1101/2021.08.17.456616
%0 Journal Article
%J bioRxiv
%D 2022
%T A versatile and interoperable computational framework for the analysis and modeling of COVID-19 disease mechanisms
%A Anna Niarakis
%A Marek Ostaszewski
%A Mazein, Alexander
%A ...
%A Gyori, Benjamin M
%A ...
%A Schneider, Reinhard
%A the COVID-19 Disease Map Community
%X The COVID-19 Disease Map project is a large-scale community effort uniting 277 scientists from 130 Institutions around the globe. We use high-quality, mechanistic content describing SARS-CoV-2-host interactions and develop interoperable bioinformatic pipelines for novel target identification and drug repurposing. Community-driven and highly interdisciplinary, the project is collaborative and supports community standards, open access, and the FAIR data principles. The coordination of community work allowed for an impressive step forward in building interfaces between Systems Biology tools and platforms. Our framework links key molecules highlighted from broad omics data analysis and computational modeling to dysregulated pathways in a cell-, tissue- or patient-specific manner. We also employ text mining and AI-assisted analysis to identify potential drugs and drug targets and use topological analysis to reveal interesting structural features of the map. The proposed framework is versatile and expandable, offering a significant upgrade in the arsenal used to understand virus-host interactions and other complex pathologies.Competing Interest StatementA. Niarakis collaborates with SANOFI-AVENTIS R&D via a public private partnership grant (CIFRE contract, no 2020/0766). D. Maier and A. Bauch are employed at Biomax Informatics AG and will be affected by any effect of this publication on the commercial version of the AILANI software. J.A. Bachman and B. Gyori received consulting fees from Two Six Labs, LLC. T. Helikar has served as a shareholder and has consulted for Discovery Collective, Inc. R. Balling and R. Schneider are founders and shareholders of MEGENO S.A. and ITTM S.A. J. Saez-Rodriguez receives funding from GSK and Sanofi and consultant fees from Travere Therapeutics. Janet Pinero and Laura I. Furlong are employees and shareholders of MedBioinformatics Solutions SL. The remaining authors have declared that they have no Conflict of interest.
%B bioRxiv
%I Cold Spring Harbor Laboratory
%G eng
%U https://www.biorxiv.org/content/early/2022/12/19/2022.12.17.520865
%R 10.1101/2022.12.17.520865
%0 Journal Article
%J eLife
%D 2021
%T Author-sourced capture of pathway knowledge in computable form using Biofactoid
%A Jeffrey Wong
%A Max Franz
%A Metin Can Siper
%A Dylan Fong
%A Funda Durupinar
%A Christian Dallago
%A Augustin Luna
%A John M. Giorgi
%A Igor Rodchenkov
%A Özgün Babur
%A John A. Bachman
%A Benjamin M. Gyori
%A Emek Demir
%A Gary Bader
%A Sander, Chris
%B eLife
%V 10
%P e68292
%G eng
%U https://elifesciences.org/articles/68292
%0 Conference Paper
%B Proceedings of the BioCreative VII Challenge Evaluation Workshop
%D 2021
%T A self-updating causal model of COVID-19 mechanisms built from the scientific literature
%A Benjamin M. Gyori
%A Bachman, John A
%A Diana Kolusheva
%B Proceedings of the BioCreative VII Challenge Evaluation Workshop
%G eng
%U https://biocreative.bioinformatics.udel.edu/media/store/files/2021/Track4_pos_5_BC7_submission_195-3.pdf
%0 Journal Article
%J Molecular Systems Biology
%D 2021
%T COVID-19 Disease Map, a computational knowledge repository of SARS-CoV-2 virus-host interaction mechanisms
%A Marek Ostaszewski
%A Anna Niarakis
%A ...
%A Gyori, Benjamin M
%A ...
%A Schneider, Reinhard
%B Molecular Systems Biology
%V 17
%P e10387
%G eng
%U https://doi.org/10.1101/2020.10.26.356014
%N 10
%0 Journal Article
%J Current Opinion in Systems Biology
%D 2021
%T From knowledge to models: Automated modeling in systems and synthetic biology
%A Benjamin M. Gyori
%A John A. Bachman
%K Automated modeling
%K Dynamical modeling
%K modeling
%K Rule-based modeling
%K synthetic biology
%K Systems Biology
%K Text mining
%X Building computational models of biological mechanisms involves collecting and synthesizing knowledge about the underlying system and encoding it in an appropriate mathematical form. While this process typically requires substantial manual effort from human experts, key aspects of the modeling process are increasingly being automated or augmented by software tools, allowing for the efficient creation of large models or model ensembles. In this review, we introduce a framework for discussing modeling automation by positioning recent work into three ‘levels’, with the human and the machine taking on different responsibilities at each level. We outline the strengths and weaknesses of current modeling approaches at the different levels and discuss the prospect of fully automated fit-to-purpose modeling of biological systems.
%B Current Opinion in Systems Biology
%V 28
%P 100362
%G eng
%U https://www.sciencedirect.com/science/article/pii/S2452310021000561
%R https://doi.org/10.1016/j.coisb.2021.100362
%0 Journal Article
%J Genome Biology
%D 2021
%T GeneWalk identifies relevant gene functions for a biological context using network representation learning
%A Ietswaart, Robert
%A Benjamin M. Gyori
%A John A. Bachman
%A Peter K. Sorger
%A L. Stirling Churchman
%X A bottleneck in high-throughput functional genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Gene Ontology (GO) enrichment methods provide insight at the gene set level. Here, we introduce GeneWalk (github.com/churchmanlab/genewalk) that identifies individual genes and their relevant functions critical for the experimental setting under examination. After the automatic assembly of an experiment-specific gene regulatory network, GeneWalk uses representation learning to quantify the similarity between vector representations of each gene and its GO annotations, yielding annotation significance scores that reflect the experimental context. By performing gene- and condition-specific functional analysis, GeneWalk converts a list of genes into data-driven hypotheses. Summary INDRA-IPM (Interactive Pathway Map) is a web-based pathway map modeling tool that combines natural language processing with automated model assembly and visualization. INDRA-IPM contextualizes models with expression data and exports them to standard formats. Availability and implementation INDRA-IPM is available at: http://pathwaymap.indra.bio. Source code is available at http://github.com/sorgerlab/indra_pathway_map. The underlying web service API is available at http://api.indra.bio:8000. Supplementary information Supplementary data are available at Bioinformatics online.
Background: For automated reading of scientific publications to
extract useful information about molecular mechanisms it is critical that
genes, proteins and other entities be correctly associated with uniform
identifiers, a process known as named entity linking or "grounding.'' Correct
grounding is essential for resolving relationships among mined information,
curated interaction databases, and biological datasets. The accuracy of this
process is largely dependent on the availability of machine-readable resources
associating synonyms and abbreviations commonly found in biomedical literature
with uniform identifiers.
Results: In a task involving automated reading of ~215,000
articles using the REACH event extraction software we found that grounding was
disproportionately inaccurate for multi-protein families (e.g., "AKT") and
complexes with multiple subunits (e.g."NF-kappaB'"). To address this
problem we constructed FamPlex, a manually curated resource defining protein
families and complexes as they are commonly encountered in biomedical text. In
FamPlex the gene-level constituents of families and complexes are defined in a
flexible format allowing for multi-level, hierarchical membership. To create
FamPlex, text strings corresponding to entities were identified empirically
from literature and linked manually to uniform identifiers; these identifiers
were also mapped to equivalent entries in multiple related databases. FamPlex
also includes curated prefix and suffix patterns that improve named entity
recognition and event extraction. Evaluation of REACH extractions on a test
corpus of ~54,000 articles showed that FamPlex significantly increased
grounding accuracy for families and complexes (from 15% to 71%). The
hierarchical organization of entities in FamPlex also made it possible to
integrate otherwise unconnected mechanistic information across families,
subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM
reading system and the Biocreative VI Bioentity Normalization Task dataset
demonstrated the utility of FamPlex in other settings.
Conclusion: FamPlex is an effective resource for improving named
entity recognition, grounding, and relationship resolution in automated reading
of biomedical text. The content in FamPlex is available in both tabular and
Open Biomedical Ontology formats at
https://github.com/sorgerlab/famplex under the Creative Commons CC0
license and has been integrated into the TRIPS/DRUM and REACH reading systems.