Software

I believe that scientific research benefits greatly from open software that is developed, tested and documented according to good software development practices. I also strongly believe that the scientific community gains more from working, robust implementations of methods, available as reusable software, than methods descriptions in manuscripts. In this spirit, I aim to make as much of my research as possible available and reusable in the form of open-source software, and have added more than 12,000 Github commits since 2016, while also mentoring a team of scientific software developers and reviewing their contributions. Some of the systems I have developed or co-developed are as follows:

  • INDRA (indra.bio, https://github.com/sorgerlab/indra) draws on natural language processing systems and structured databases to collect mechanistic and causal assertions, represents them in a standardized form, and assembles them into various modeling formalisms including causal graphs and dynamical models. As its leading developer, I have contributed almost 5,000 commits to INDRA since 2015. It has become the basis of an ecosystem of other applications during this period.
  • OpenComet (cometbio.org, https://github.com/bgyori/opencomet) is an open-source software tool providing automated analysis of comet assay images. It has been downloaded thousands of times and cited in hundreds of scientific publications for measuring DNA damage in cells.
  • Bioagents (https://github.com/sorgerlab/bioagents) are problem-solving agents that integrate with human-machine dialogue systems, including the dialogue.bio system. Their capabilities range from incremental automated model building, automated model simulation and analysis, drug-target-disease associations and mechanism searches in the scientific literature, all in the context of real-time human-machine dialogue.
  • Gilda (https://github.com/indralab/gilda, grounding.indra.bio) is a Python package and REST service that grounds (i.e., finds appropriate identifiers in namespaces for) named entities in biomedical text. It includes more than ~1,000 machine-learned disambiguation models to choose appropriate senses for ambiguous biomedical entities based on the text context they appear in. It is also very fast, able to ground between 1,000-10,000 strings per second.
  • FamPlex (https://github.com/sorgerlab/famplex) implements a collection of resources for human protein families and complexes, synonyms by which they often appear in the scientific literature, and their taxonomical relationships to specific proteins.
  • Protmapper (https://github.com/indralab/protmapper) maps references to sites on human proteins to the human reference sequence based on UniProt, PhosphoSitePlus, and manual curation. It can be used as a Python package, as well as a CLI and a REST service.
  • GeneWalk (https://github.com/churchmanlab/genewalk) determines for individual genes the functions that are relevant in a particular biological context and experimental condition. It uses a random-walk-based deep learning method to capture the context in which individual genes appear network of mechanisms provided by INDRA and other systems.
  • EMMAA (https://github.com/indralab/emmaa) implements an Ecosystem of Machine-maintained Models with Automated Analysis, deployed at emmaa.indra.bio. EMMAA evolved from my implementation of The RAS Machine, a framework (https://github.com/sorgerlab/indra/tree/master/indra/tools/machine) for incremental model building by monitoring the literature, originally part of INDRA.
  • PyBioPAX (https://github.com/indralab/pybiopax) implements the BioPAX level 3 object model as a set of Python classes. It exposes API functions to read OWL files into this object model, and to dump OWL files from this object model. This allows for the processing and creation of BioPAX models natively in Python.
  • PyKQML (https://github.com/bgyori/pykqml) is a Python implementation of the Knowledge Query and Manipulation Language, a language and protocol for communication among software agents and knowledge-based systems.
  • INDRA World Modelers service (https://github.com/indralab/indra_wm_service, wm.indra.bio) implements a set of services and assembly procedures to integrate INDRA with machine reading systems and graphical user interfaces in the context of the DARPA World Modelers program.
  • INDRA Interactive Pathway Map (https://github.com/pvtodorov/indra_pathway_map), deployed at, http://pathwaymap.indra.bio/, build on INDRA to allow users to build, contextualize, and share biological pathway models by describing them in natural language. 
  • Non-asymptotic confidence intervals for Markov-chain Monte Carlo (https://github.com/bgyori/mcmc_nonasym)
  • Approximate probabilistic verification of hybrid systems (https://github.com/bgyori/hybrid)