Application of Suffix Arrays to Detect Repeats in Genomic Sequences
Nazipova N.N.
Institute of Mathematical Problems of Biology, Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia
Abstract. In recent years, genomes of higher eukaryotes without gaps have become available to researchers. These are the so-called T2T (telomere-to-telomere) assemblies. Compared to the reference genomes studied before the T2T era, they consist of a minimal number of undefined regions or do not contain them at all. Comprehensive information on telomeres, centromeres, ribosomal RNA, complex chromosomal regions and transposon elements has now become available. New computational approaches are needed to find all repeats in genomic sequences without exception. In this article, we described some of the algorithmic problems that we solved in the course of our work on the technology of de novo detection of long fuzzy repeat fragments in huge symbolic arrays that represent genomic sequences. Approaches that use the k-mer composition of the analyzed sequence are promising. The article discusses the importance of choosing an adequate value of k, since the result of the analisys depends on this key parameter of the algorithm.
Key words: DNA repeats, suffix and prefix trees, suffix array, Burrows – Wheeler Transform, k-mer dictionary