Russian version English version
Volume 20   Issue 2   Year 2025
Application of Suffix Arrays to Detect Repeats in Genomic Sequences

Nazipova N.N.

Institute of Mathematical Problems of Biology, Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia
 
Abstract. In recent years, genomes of higher eukaryotes without gaps have become available to researchers. These are the so-called T2T (telomere-to-telomere) assemblies. Compared to the reference genomes studied before the T2T era, they consist of a minimal number of undefined regions or do not contain them at all. Comprehensive information on telomeres, centromeres, ribosomal RNA, complex chromosomal regions and transposon elements has now become available. New computational approaches are needed to find all repeats in genomic sequences without exception. In this article, we described some of the algorithmic problems that we solved in the course of our work on the technology of de novo detection of long fuzzy repeat fragments in huge symbolic arrays that represent genomic sequences. Approaches that use the k-mer composition of the analyzed sequence are promising. The article discusses the importance of choosing an adequate value of k, since the result of the analisys depends on this key parameter of the algorithm.
 
Key words: DNA repeats, suffix and prefix trees, suffix array, Burrows – Wheeler Transform, k-mer dictionary
 
Table of Contents Original Article
Nazipova N.N. Application of Suffix Arrays to Detect Repeats in Genomic Sequences. Ìàthematical biology and bioinformatics. 2025;20(2):348-362. doi: 10.17537/2025.20.348
(published in Russian)

Abstract (rus.)
Abstract (eng.)
Full text (rus., pdf)
References

 

  Copyright IMPB RAS © 2005-2026