Principal Component Analysis in Targeted Approach to Coronavirus Genus Recognition
Chaley M.B.1, Kutyrkin V.A.2
1Institute of Mathematical Problems of Biology RAS, Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia
2Moscow State Technical University n.a. N.E. Bauman, Moscow, Russia
Abstract. An original approach to coronavirus classification is proposed which is basing on presentation of gene analyzed (N-gene of nucleocapsid protein) by corresponding vector of amino acid codon frequencies and its subsequent comparison with vector of averaged codon frequencies for the known N-genes of viral taxon (one of the four coronavirus genera). Principal component analysis is used in non-standard way to determine whether frequency vector analyzed belongs to one of the taxons under consideration. Method was tested on 5769 N-genes of the four coronavirus genera and showed reliability of genus recognition above 95 %. Approach proposed for classification of the coronaviruses allows reducing dimension of codon frequency vector to 28 components without decrease of reliability, by considering the most significant amino acid codon frequencies in N-gene. The approach refers to alignment free methods which become increasingly popular in the last decade for virus classification.
Key words: N-gene, coronaviruses, classification, alignment free methods, principal component analysis