Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų , Kaunas, Lithuania
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Data: genetic (DNA) sequences Meaning: represent genetic information stored in DNA molecule in symbolic form Syntax: 4-letter alphabet {A, C, G, T} Complexity: numerous layers of information protein-coding genes regulatory sequences mRNA sequences responsible for protein structure directions from DNA packaging and unwinding, etc. Motivation: over 95% - “junk DNA” (biological function is not fully understood) Aim: identify structural parts of DNA introns, exons, promoters, splice sites, etc.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ What are promoters? Promoter: a regulatory region of DNA located upstream of a gene, providing a control point for gene transcription Function: by binding to promoter, specific proteins (Transcription Factors) can either promote or repress the transcription of a gene Structure: promoters contain binding sites or “boxes” – short DNA subsequences, which are (usually) conserved exon1exon3 exon2 Promoter StartStop intron Gene
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Promoter recognition problem Multitude of promoter “boxes” (nucleotide patterns) TATA, Pribnow, Gilbert, DPE, E-box, Y-box, … “Boxes” within a species are conserved, but there are many exceptions to this rule (a) Exact pattern = TACACC CAATGCAGGA TACACC GATCGGTA (b) Pattern with mismatches = TACACC + 1 mismatch CAATGCAGGA TTCACC GATCGGTA (c) Degenerate pattern = TASDCC ( S ={ C,G }, D ={ A,G,T }) CAATGCAGGA TAGTCC GATCGGTA
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Quality of classification Training data size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function linear, polynomial, RBF, sigmoid Kernel function parameters SVM learning parameters Regularization parameter, Cost factor Selection of SVM parameter values – an optimization problem
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM optimization strategies Kernel optimization Putting additional parameters Designing new kernels Parameter optimization Learning parameters only Kernel parameters only Learning & kernel parameters Optimization decisions Optimization method Objective function
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM (hyper)parameters Kernel parameters Learning parameters
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM parameter optimization methods MethodAdvantagesDisadvantages Random search Simplicity.Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid. Nelder- Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Dataset Drosophila sequence datasets: Promoter dataset: 1842 sequences, each 300 bp length, from -250 bp to +50 bp with regards to the gene transcription site location Intron dataset: 1799 sequences, each 300 bp length Coding sequence (CDS) dataset: 2859 sequences, each 300 bp length Datasets for SVM classifier: Training file: 1260 examples (372 promoters, 361 introns, 527 CDS) Test file: 6500 examples (1842 promoters, 1799 introns, 2859 CDS) Datasets are unbalanced: 29.5% promoters vs. 70.5% non-promoters in the training dataset 28.3% promoters vs. 71.7% non-promoters in the test dataset
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Classification requisites Feature mapping: orthogonal Kernel function: power series kernel Metrics: Specificity (SPC) Sensitivity (TPR) SVM classifier: SVM light SVM parameter optimization method: Modified Nelder-Mead (downhill simplex)
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Modification of Nelder-Mead Optimization time problem: Call to SVM training and testing function is very time-costly for large datasets Requires many evaluations of objective function Modifications: Function value caching Normalization after reflection step
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Classification results KernelNo. of optimized parameters Type of optimized parameters Classification evaluation metric Specificity (SPC) Sensitivity (TPR) Linear-none84.83%58.25% Linear3learning91.23%81.38% Polynomial-none81.81%44.90% Polynomial6learning + kernel 87.64%67.48% Power series (2)3kernel94.85%89.69% Power series (3)4kernel94.92%89.95%
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ ROC plot 100
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Conclusions SVM classifier alone can not achieve satisfactory classification results for a complex unbalanced dataset SVM parameter optimization can improve classification results significantly Best results can be achieved when SVM parameter optimization is combined with kernel function modification Power series kernel is particularly suitable for optimization because of a larger number of kernel parameters
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Ongoing work and future research Application of SVM parameter optimization for splice site recognition problem [presented in CISIS’2008] Selection of rules for optimal DNA sequence mapping to the feature space [accepted to WCSB’2008] Analysis of the relationships between data characteristics and classifier behavior [accepted to IS’2008] Automatic derivation of formal grammars rules [accepted to KES’2008] Structural analysis of sequences using SVM with grammar inference [accepted to ITA’2008]
Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Thank You. Any questions?