Publications

2018
Břinda, K., et al., 2018. Lineage calling can identify antibiotic resistant clones within minutes. bioRxiv 403204. Publisher's VersionAbstract
Surveillance of circulating drug resistant bacteria is essential for healthcare providers to deliver effective empiric antibiotic therapy. However, the results of surveillance may not be available on a timescale that is optimal for guiding patient treatment. Here we present a method for inferring characteristics of an unknown bacterial sample by identifying the presence of sequence variation across the genome that is linked to a phenotype of interest, in this case drug resistance. We demonstrate an implementation of this principle using sequence k-mer content, matched to a database of known genomes. We show this technique can be applied to data from an Oxford Nanopore device in real time and is capable of identifying the presence of a known resistant strain in 5 minutes, even from a complex metagenomic sample. This flexible approach has wide application to pathogen surveillance and may be used to greatly accelerate diagnoses of resistant infections.
2018-lineage-calling-resistance.pdf
Grüning, B., et al., 2018. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods , 15 (7) , pp. 475–476. Publisher's VersionAbstract
We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges.
2018-bioconda.pdf
Břinda, K., Boeva, V. & Kucherov, G., 2018. Ococo: an online variant and consensus caller. arxiv:1712.01146v2. Publisher's VersionAbstract

Motivation: Identifying genomic variants is an essential step for connecting genotype and phenotype. The usual approach consists of statistical inference of variants from alignments of sequencing reads. State-of-the-art variant callers can resolve a wide range of different variant types with high accuracy. However, they require that all read alignments be available from the beginning of variant calling and be sorted by coordinates. Sorting is computationally expensive, both memory- and speed-wise, and the resulting pipelines suffer from storing and retrieving large alignments files from external memory. Therefore, there is interest in developing methods for resource-efficient variant calling.

Results: We present Ococo, the first program capable of inferring variants in a real-time, as read alignments are fed in. Ococo inputs unsorted alignments from a stream and infers single-nucleotide variants, together with a genomic consensus, using statistics stored in compact several-bit counters. Ococo provides a fast and memory-efficient alternative to the usual variant calling. It is particularly advantageous when reads are sequenced or mapped progressively, or when available computational resources are at a premium.

2018-ococo.pdf
2016
Červenka, P., et al., 2016. Blind Friendly Maps: Tactile Maps for the Blind as a Part of the Public Map Portal (Mapy.cz). In K. Miesenberger, C. Bühler, & P. Peňáz, ed. Computers Helping People with Special Needs. ICCHP 2016. Lecture Notes in Computer Science, vol 9759. Springer International Publishing, pp. 131–138. Publisher's VersionAbstract
Blind people can now use maps located at Mapy.cz, thanks to the long-standing joint efforts of the ELSA Center at the Czech Technical University in Prague, the Teiresias Center at Masaryk University, and the company Seznam.cz. Conventional map underlays are automatically adjusted so that they could be read through touch after being printed on microcapsule paper, which opens a whole new perspective in the use of tactile maps. Users may select an area of their choice in the Czech Republic (only within its boundaries, for the time being) and also the production of tactile maps, including the preparation of the map underlays, takes no more than several minutes.
2016-blind-friendly-maps.pdf
Břinda, K., Boeva, V. & Kucherov, G., 2016. Dynamic read mapping and online consensus calling for better variant detection. arXiv:1605.09070. Publisher's VersionAbstract
Variant detection from high-throughput sequencing data is an essential step in identification of alleles involved in complex diseases and cancer. To deal with these massive data, elaborated sequence analysis pipelines are employed. A core component of such pipelines is a read mapping module whose accuracy strongly affects the quality of resulting variant calls. We propose a dynamic read mapping approach that significantly improves read alignment accuracy. The general idea of dynamic mapping is to continuously update the reference sequence on the basis of previously computed read alignments. Even though this concept already appeared in the literature, we believe that our work provides the first comprehensive analysis of this approach. To evaluate the benefit of dynamic mapping, we developed a software pipeline (http://github.com/karel-brinda/dymas) that mimics different dynamic mapping scenarios. The pipeline was applied to compare dynamic mapping with the conventional static mapping and, on the other hand, with the so-called iterative referencing - a computationally expensive procedure computing an optimal modification of the reference that maximizes the overall quality of all alignments. We conclude that in all alternatives, dynamic mapping results in a much better accuracy than static mapping, approaching the accuracy of iterative referencing. To correct the reference sequence in the course of dynamic mapping, we developed an online consensus caller named OCOCO (http://github.com/karel-brinda/ococo). OCOCO is the first consensus caller capable to process input reads in the online fashion. Finally, we provide conclusions about the feasibility of dynamic mapping and discuss main obstacles that have to be overcome to implement it. We also review a wide range of possible applications of dynamic mapping with a special emphasis on variant detection.
2016-dynamic-mapping.pdf
Břinda, K., 2016. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. Université Paris-Est. Publisher's VersionAbstract

Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.

In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.

An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.

Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up.

2016-phd-thesis.pdf
2015
Břinda, K., Boeva, V. & Kucherov, G., 2015. RNF: a general framework to evaluate NGS read mappers. Bioinformatics , 32 (1) , pp. 136-139. Publisher's VersionAbstract

MOTIVATION: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created.In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads.

RESULTS: To solve this obstacle, we have created a generic format Read Naming Format (Rnf) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RnfTools containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim, etc.) and transforms the generated reads into Rnf format. LAVEnder evaluates then a given read mapper using simulated reads in Rnf format. A special attention is payed to mapping qualities that serve for parametrization of Roc curves, and to evaluation of the effect of read sample contamination.

AVAILABILITY AND IMPLEMENTATION: RnfTools: http://karel-brinda.github.io/rnftools Spec. of Rnf: http://karel-brinda.github.io/rnf-spec.

2015-rnftools.pdf
Břinda, K., Sykulski, M. & Kucherov, G., 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics , 31 (22) , pp. 3584–3592. Publisher's VersionAbstract

MOTIVATION: Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by next-generation sequencing technologies. To cope with massive data involved in modern metagenomic projects, recent tools rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes.

RESULTS: Within this general framework, we show that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers. We support this thesis through a series of different computational experiments, including simulations of large-scale metagenomic projects.

Availability and implementation, Supplementary information: Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.

2015-spaced-seeds-metagenomics.pdf
2014
Břinda, K., Pelantová, E. & Turek, O., 2014. Balances of m-bonacci words. Fundamenta informaticae , 132. Publisher's VersionAbstract
The m-bonacci word is a generalization of the Fibonacci word to the m-letter alphabet A = {0, . . . , m − 1}. It is the unique fixed point of the Pisot–type substitution ϕm : 0 → 01, 1 → 02, . . . , (m − 2) → 0(m − 1), and (m − 1) → 0. A result of Adamczewski implies the existence of constants c (m) such that the m-bonacci word is c (m) -balanced, i.e., numbers of letter a occurring in two factors of the same length differ at most by c (m) for any letter a ∈ A. The constants c (m) have been already determined for m = 2 and m = 3. In this paper we study the bounds c (m) for a general m ≥ 2. We show that the m-bonacci word is (⌊κm⌋ + 12)-balanced, where κ ≈ 0.58. For m ≤ 12, we improve the constant c (m) by a computer numerical calculation to the value ⌈ m+1 2 ⌉.
2014-balances-m-bonacci.pdf
Břinda, K., 2014. Languages of lossless seeds. In Z. Ésik & Z. Fülöp, ed. Proceedings of the 14th International Conference on Automata and Formal Languages. pp. 139–150. Publisher's VersionAbstract
Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that lossless seeds coincide with languages of certain sofic subshifts, hence they can be recognized by finite automata. Moreover, we show that these subshifts are fully given by the number of allowed errors k and the seed margin ℓ. We also show that for a fixed k, optimal seeds must asymptotically satisfy ℓ ∈ Θ(m^(k/(k+1))).
2014-languages-lossless-seeds.pdf
2013
Břinda, K., 2013. Lossless seeds for approximate string matching. Czech Technical University in Prague. Publisher's VersionAbstract
The thesis deals with lossless seeds, which were originally proposed for the purposes of the filtration phase in the DNA similarity search. The case of designing seeds for one error is studied first and sufficiently solved using the greedy algorithm. Obtained results are generalized for the case of two errors, nevertheless, it is shown that the same algorithm does not provide asymptotically optimal seeds in this case. Further on, an idea of seed design based on the so-called cyclic rulers is introduced. Possibilities of generalization for the case of more errors are mentioned. At the end the software created for the purposes of this thesis, which is given on the enclosed CD, is described.
2013-ms-thesis.pdf
2012
Břinda, K., 2012. On the balance of d-bonacci word. In V. Halava, J. Karhumäki, & Y. Matiyasevich, ed. RuFiDiM II, Proceedings of the Second Russian Finnish Symposium on Discrete Mathematics 2012. Turku. Turku. Publisher's Version
2011
Břinda, K., 2011. Abelian complexity of infinite words. Czech Technical University in Prague. Publisher's VersionAbstract
This thesis deals with Abelian complexity of infinite words, i.e., function describing complexity of a given infinite word using sets of Parikh vectors of finite factors of the word. Some of already known results for words generated by primitive substitutions (including connection between Abelian complexity, balance function, discrepancy function and matrix of a primitive substitution) are summarized. Then an explicit formula for Abelian complexity of infinite words associated with quadratic Parry numbers is derived. In addition, a new toolkit TigrWords, which facilitates study of infinite words approximated by their prefixes and may serve to discover of new hypotheses, is described.
2011-bs-thesis.pdf
Balková, Ľ., Břinda, K. & Turek, O., 2011. Abelian complexity of infinite words associated with quadratic Parry numbers. Theoretical Computer Science , 412 (45) , pp. 6252–6260. Publisher's VersionAbstract
We derive an explicit formula for the Abelian complexity of infinite words associated with quadratic Parry numbers.
2011-ac-quadratic-parry.pdf