Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert Abbott 1, Don Jones 3, and Anna Blenda 1 1 Department of Genetics and Biochemistry, Clemson University, Biosystems Research Complex, 51 New Cherry Street, Clemson, SC, 29634, USA 2 School of Computing, Clemson University, 100 McAdams, Clemson, SC, 29634, USA 3 Cotton Incorporated, 6399 Weston Parkway, Cary, NC, 27513, USA Microsatellites, or simple sequence repeats (SSRs), are used as molecular markers with wide-ranging applications in the field of cotton molecular breeding. The Cotton Marker Database (CMD; provides centralized access to publicly available cotton molecular data. In collaboration with the contributing researchers, we have summarized and provided high quality data for 11,938 SSRs displayed through CMD. However, SSR redundancy is common and inevitable issue for projects coming from different research groups. The method of SSR redundancy detection using the SSR-containing sequence alignment approach gives high number of false-positives even when applying stringent parameters, since the similarity identification is based only on the sequence comparison. To improve the accuracy of the redundant SSRs detection and reduce the cost of expert intervention in polymorphism discovery, we proposed the application of the machine learning approach based on the Support Vector Machine (SVM) algorithm [1, 2]. INTRODUCTION Table 1. Evaluation of results obtained for the tested data. 1. R.-E. Fan, P.-H. Chen, C.-J. Lin. Working set selection using the second order information for training SVM. Journal of Machine Learning Research Lakshmi K, John J. Application of machine learning in SNP discovery. BMC Bioinformatics SSR Training Data (4 features) 99 (Positive) & 106 (Negative) Expert Decision SVM Program LIBSVM Parameter Scaling Cross-validation Grid-search Kernel Functions Best Parameters SSR Testing Data 100 (Positive) & 119 (Negative) Performance Verification SSR Prediction Data 648 (Positive) Model Prediction Classification Model Kernel Function SSR Training and Testing Data TP*FP*TN*FN* Sensitivity % Specificity % Precision % Accuracy % F-score % linear %98.00 polynomial %--- radial basis %97.31 sigmoid %98.02 MATERIALS AND METHODS REFERENCES The CMD SSR dataset (847 markers) was used as training, testing and prediction sets for the SVM algorithm (Figure 1). We chose 4 important SSR features: Percent match of primer sequences The SSR primer sequence is an important referenced factor in genetic research; it is used to isolate targeted sections of DNA for amplification in PCR. The primer sequence alignment can be calculated by CD-HIT program. Primer match type Type 1: Forward to forward match, reverse to reverse match. Type 2: forward to reverse match, or reverse to reverse match. Motif similarity SSR motif similarity is another important factor reflecting the degree of SSR redundancy. Percent match of SSR-containing sequences A BLAST search allows to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. SSR genetic map position Based on this feature, the training data were manually selected and the final results were evaluated. DISCUSSION RESULT Our experiment showed that this machine learning approach based on the 4 selected features gives high sensitivity and specificity, and it can be used either to identify questionable similarity results (Example A), or confirm the initial SSR similarity (Example B) after the first step of the SSR redundancy detection based on the SSR-contsaining sequence alignment. This SVM algorithm can be subsequently used to directly filter the data generated by the BLAST alignment program. We acknowledge with thanks, Cotton Incorporated for funding CMD project and related research Figure 1. The machine learning workflow. SVM with different kernel functions was applied to develop a method for accurate detection of SSR redundancy. The best results were obtained by using the sigmoid kernel, where the obtained sensitivity and F-score values were higher compared to the other kernel functions tested (Table 1). These results indicate that SVM-based method identifies true SSR redundancy with high accuracy. EXAMPLES of SSR Prediction Data Redundancy PairPrimer MatchMatch TypeMotif SimilaritySSR Blast NAU864 - MUSS298 96%Forward – Forward100%812 Marker Name Ch/ LG Position (cM) Cross Map NAU RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2006 MUSS RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2006 Redundancy PairPrimer MatchMatch TypeMotif SimilaritySSR Blast BNL BNL %Reverse – Reverse100%670 Marker Name Ch/LG Position (cM) Cross Map BNL3031 9/LG BC1-RIL: ("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2004 BNL1672 9/LG BC1-RIL: ("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2004 Example B. SSR similarity based on initial sequence alignment and confirmed by SVM. * TP – true positive, FP – false positive, TN – true negative, FN – false negative. Example A.Similarity of 2 SSRs based on initial sequence alignment, but disagreeing with SVM results. The genetic map positions of 2 SSRs do not match, which indicates the correction of SVM prediction.