Publications

Journal Article
Břinda, K., Callendrello, A., Ma, K. C., MacFadden, D. R., Charalampous, T., Lee, R. S., Cowley, L., et al. (2020). Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature Microbiology , (5), 455–464. Publisher's VersionAbstract
Surveillance of drug-resistant bacteria is essential for healthcare providers to deliver effective empiric antibiotic therapy. However, traditional molecular epidemiology does not typically occur on a timescale that could impact patient treatment and outcomes. Here we present a method called ‘genomic neighbor typing’ for inferring the phenotype of a bacterial sample by identifying its closest relatives in a database of genomes with metadata. We show that this technique can infer antibiotic susceptibility and resistance for both S. pneumoniae and N. gonorrhoeae. We implemented this with rapid k-mer matching, which, when used on Oxford Nanopore MinION data, can run in real time. This resulted in determination of resistance within ten minutes (sens/spec 91%/100% for S. pneumoniae and 81%/100% N. gonorrhoeae from isolates with a representative database) of sequencing starting, and for clinical metagenomic sputum samples (75%/100% for S. pneumoniae), within four hours of sample collection. This flexible approach has wide application to pathogen surveillance and may be used to greatly accelerate appropriate empirical antibiotic treatment.
2020-genomic-neighbor-typing.pdf
MacFadden, D. R., Coburn, B., Břinda, K., Corbeil, A., Daneman, N., Fisman, D., Lee, R., et al. (2020). Using genetic distance from archived samples for the prediction of antibiotic resistance in Escherichia coli. Antimicrobial Agents and Chemotherapy. Publisher's VersionAbstract

Background. Rising antibiotic resistance increasingly compromises empiric treatment. Knowing the antibiotic susceptibility of a pathogen's close genetic relative(s) may improve empiric antibiotic selection.

Methods. Using genomic and phenotypic data from three separate clinically-derived databases of Escherichia coli isolates, we evaluated multiple genomic methods and statistical models for predicting antibiotic susceptibility, focusing on potentially rapidly available information such as lineage or genetic distance from archived isolates. We applied these methods to derive and validate prediction of antibiotic susceptibility to common antibiotics.

Results. We evaluated 968 separate episodes of suspected and confirmed infection with Escherichia coli from three geographically and temporally separated databases in Ontario, Canada, from 2010-2018. Across all approaches, model performance (AUC) ranges for predicting antibiotic susceptibility were greatest for ciprofloxacin (0.76-0.97), and lowest for trimethoprim-sulfamethoxazole (0.51-0.80). When a model predicted a susceptible isolate, the resulting (post-test) probabilities of susceptibility were sufficient to warrant empiric therapy for most antibiotics (mean 92%). An approach combining multiple models could permit the use of narrower-spectrum oral agents in 2 out of every 3 patients while maintaining high treatment adequacy (∼90%).

Conclusions. Methods based on genetic relatedness to archived samples in E. coli could be used to predict antibiotic resistance and improve antibiotic selection.

2020-using-genetic-distance.pdf
Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., et al. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods , 15 (7), 475–476. Publisher's VersionAbstract
We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges.
2018-bioconda.pdf
Břinda, K., Boeva, V., & Kucherov, G. (2015). RNF: a general framework to evaluate NGS read mappers. Bioinformatics , 32 (1), 136-139. Publisher's VersionAbstract

MOTIVATION: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created.In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads.

RESULTS: To solve this obstacle, we have created a generic format Read Naming Format (Rnf) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RnfTools containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim, etc.) and transforms the generated reads into Rnf format. LAVEnder evaluates then a given read mapper using simulated reads in Rnf format. A special attention is payed to mapping qualities that serve for parametrization of Roc curves, and to evaluation of the effect of read sample contamination.

AVAILABILITY AND IMPLEMENTATION: RnfTools: http://karel-brinda.github.io/rnftools Spec. of Rnf: http://karel-brinda.github.io/rnf-spec.

2015-rnftools.pdf
Břinda, K., Sykulski, M., & Kucherov, G. (2015). Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics , 31 (22), 3584–3592. Publisher's VersionAbstract

MOTIVATION: Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by next-generation sequencing technologies. To cope with massive data involved in modern metagenomic projects, recent tools rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes.

RESULTS: Within this general framework, we show that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers. We support this thesis through a series of different computational experiments, including simulations of large-scale metagenomic projects.

Availability and implementation, Supplementary information: Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.

2015-spaced-seeds-metagenomics.pdf
Břinda, K., Pelantová, E., & Turek, O. (2014). Balances of m-bonacci words. Fundamenta informaticae , 132 (1), 33-61. Publisher's VersionAbstract
The m-bonacci word is a generalization of the Fibonacci word to the m-letter alphabet A = {0, . . . , m − 1}. It is the unique fixed point of the Pisot–type substitution ϕm : 0 → 01, 1 → 02, . . . , (m − 2) → 0(m − 1), and (m − 1) → 0. A result of Adamczewski implies the existence of constants c (m) such that the m-bonacci word is c (m) -balanced, i.e., numbers of letter a occurring in two factors of the same length differ at most by c (m) for any letter a ∈ A. The constants c (m) have been already determined for m = 2 and m = 3. In this paper we study the bounds c (m) for a general m ≥ 2. We show that the m-bonacci word is (⌊κm⌋ + 12)-balanced, where κ ≈ 0.58. For m ≤ 12, we improve the constant c (m) by a computer numerical calculation to the value ⌈ m+1 2 ⌉.
2014-balances-m-bonacci.pdf
Balková, Ľ., Břinda, K., & Turek, O. (2011). Abelian complexity of infinite words associated with quadratic Parry numbers. Theoretical Computer Science , 412 (45), 6252–6260. Publisher's VersionAbstract
We derive an explicit formula for the Abelian complexity of infinite words associated with quadratic Parry numbers.
2011-ac-quadratic-parry.pdf
Manuscript
Břinda, K., Baym, M., & Kucherov, G. (2020). Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv. Publisher's VersionAbstract

Motivation De Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes.

Results We introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes.

Availability ProphAsm is written in C++ and is available under the MIT license from http://github.com/prophyle/prophasm.

Břinda, K., Boeva, V., & Kucherov, G. (2018). Ococo: an online variant and consensus caller. arxiv:1712.01146v2. Publisher's VersionAbstract

Motivation: Identifying genomic variants is an essential step for connecting genotype and phenotype. The usual approach consists of statistical inference of variants from alignments of sequencing reads. State-of-the-art variant callers can resolve a wide range of different variant types with high accuracy. However, they require that all read alignments be available from the beginning of variant calling and be sorted by coordinates. Sorting is computationally expensive, both memory- and speed-wise, and the resulting pipelines suffer from storing and retrieving large alignments files from external memory. Therefore, there is interest in developing methods for resource-efficient variant calling.

Results: We present Ococo, the first program capable of inferring variants in a real-time, as read alignments are fed in. Ococo inputs unsorted alignments from a stream and infers single-nucleotide variants, together with a genomic consensus, using statistics stored in compact several-bit counters. Ococo provides a fast and memory-efficient alternative to the usual variant calling. It is particularly advantageous when reads are sequenced or mapped progressively, or when available computational resources are at a premium.

2018-ococo.pdf
Břinda, K., Boeva, V., & Kucherov, G. (2016). Dynamic read mapping and online consensus calling for better variant detection. arXiv:1605.09070. Publisher's VersionAbstract
Variant detection from high-throughput sequencing data is an essential step in identification of alleles involved in complex diseases and cancer. To deal with these massive data, elaborated sequence analysis pipelines are employed. A core component of such pipelines is a read mapping module whose accuracy strongly affects the quality of resulting variant calls. We propose a dynamic read mapping approach that significantly improves read alignment accuracy. The general idea of dynamic mapping is to continuously update the reference sequence on the basis of previously computed read alignments. Even though this concept already appeared in the literature, we believe that our work provides the first comprehensive analysis of this approach. To evaluate the benefit of dynamic mapping, we developed a software pipeline (http://github.com/karel-brinda/dymas) that mimics different dynamic mapping scenarios. The pipeline was applied to compare dynamic mapping with the conventional static mapping and, on the other hand, with the so-called iterative referencing - a computationally expensive procedure computing an optimal modification of the reference that maximizes the overall quality of all alignments. We conclude that in all alternatives, dynamic mapping results in a much better accuracy than static mapping, approaching the accuracy of iterative referencing. To correct the reference sequence in the course of dynamic mapping, we developed an online consensus caller named OCOCO (http://github.com/karel-brinda/ococo). OCOCO is the first consensus caller capable to process input reads in the online fashion. Finally, we provide conclusions about the feasibility of dynamic mapping and discuss main obstacles that have to be overcome to implement it. We also review a wide range of possible applications of dynamic mapping with a special emphasis on variant detection.
2016-dynamic-mapping.pdf
Miscellaneous
Červenka, P., Břinda, K., Hanousková, M., Hofman, P., & Seifert, R. (2016). Blind Friendly Maps: Tactile Maps for the Blind as a Part of the Public Map Portal (Mapy.cz). K. Miesenberger, C. Bühler, & P. Peňáz (Ed.), Computers Helping People with Special Needs. ICCHP 2016. Lecture Notes in Computer Science, vol 9759 , 131–138 . Springer International Publishing. Publisher's VersionAbstract
Blind people can now use maps located at Mapy.cz, thanks to the long-standing joint efforts of the ELSA Center at the Czech Technical University in Prague, the Teiresias Center at Masaryk University, and the company Seznam.cz. Conventional map underlays are automatically adjusted so that they could be read through touch after being printed on microcapsule paper, which opens a whole new perspective in the use of tactile maps. Users may select an area of their choice in the Czech Republic (only within its boundaries, for the time being) and also the production of tactile maps, including the preparation of the map underlays, takes no more than several minutes.
2016-blind-friendly-maps.pdf
Břinda, K. (2014). Languages of lossless seeds. Z. Ésik & Z. Fülöp (Ed.), Proceedings of the 14th International Conference on Automata and Formal Languages , 151, 139–150. Publisher's VersionAbstract
Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that lossless seeds coincide with languages of certain sofic subshifts, hence they can be recognized by finite automata. Moreover, we show that these subshifts are fully given by the number of allowed errors k and the seed margin ℓ. We also show that for a fixed k, optimal seeds must asymptotically satisfy ℓ ∈ Θ(m^(k/(k+1))).
2014-languages-lossless-seeds.pdf
Thesis
Břinda, K. (2016). Novel computational techniques for mapping and classifying Next-Generation Sequencing data . Université Paris-Est. Publisher's VersionAbstract

Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.

In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.

An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.

Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up.

2016-phd-thesis.pdf
Břinda, K. (2013). Lossless seeds for approximate string matching . Czech Technical University in Prague. Publisher's VersionAbstract
The thesis deals with lossless seeds, which were originally proposed for the purposes of the filtration phase in the DNA similarity search. The case of designing seeds for one error is studied first and sufficiently solved using the greedy algorithm. Obtained results are generalized for the case of two errors, nevertheless, it is shown that the same algorithm does not provide asymptotically optimal seeds in this case. Further on, an idea of seed design based on the so-called cyclic rulers is introduced. Possibilities of generalization for the case of more errors are mentioned. At the end the software created for the purposes of this thesis, which is given on the enclosed CD, is described.
2013-ms-thesis.pdf
Břinda, K. (2011). Abelian complexity of infinite words . Czech Technical University in Prague. Publisher's VersionAbstract
This thesis deals with Abelian complexity of infinite words, i.e., function describing complexity of a given infinite word using sets of Parikh vectors of finite factors of the word. Some of already known results for words generated by primitive substitutions (including connection between Abelian complexity, balance function, discrepancy function and matrix of a primitive substitution) are summarized. Then an explicit formula for Abelian complexity of infinite words associated with quadratic Parry numbers is derived. In addition, a new toolkit TigrWords, which facilitates study of infinite words approximated by their prefixes and may serve to discover of new hypotheses, is described.
2011-bs-thesis.pdf