Detection of Mammalian Coding Sequences Using a Hybrid Approach of Chaos Game Representation and Machine Learning
Mammalian protein-coding sequence detection provides a wide range of applications in biodiversity research, evolutionary studies, and understanding of genomic features. Representation of genomic sequences in Chaos Game Representation (CGR) helps reveal hidden features in DNA sequences due to its ability to represent sequences in both numerical and graphical levels. Machine learning approaches can automatically detect hidden patterns in CGR images by detecting and classifying protein-coding and noncoding patterns accurately. Here, we propose a pipeline that automatically detects coding (exons) and non-coding (intron) sequences in mammalian genomes. We collected coding and non-coding sequences from 20 mammalian-specific genes (PRM3, CSN1S1, LCE6A, IL2, MUC7, NNAT, IGIP, SMCP, DCD, and MYEOV), and GCR images were generated from these genes. Five supervised machine learning classifier algorithms (Naive Bayes algorithm, Logistic Regression algorithm, K-Nearest Neighbor algorithm, Perceptron algorithm, and support vector machine (SVM)) were evaluated, using features extracted from CGR images as an input. In Summary, the Ensemble model between Logistic regression, perceptron and SVM has achieved the highest accuracy with a mean performance of mean and standard deviation. Our findings recommend applying an ensemble model between Logistic regression, Perceptron and SVM for classifying coding and non-coding sequences for future mammalian related CGR genomic studies. © 2020 IEEE.