Download presentation
Presentation is loading. Please wait.
1
SVM: Non-coding Neutral Sequences Vs Regulatory Modules Ying Zhang, BMB, Penn State Ritendra Datta, CSE, Penn State Bioinformatics – I Fall 2005
2
Outline Background: Machine Learning & Bioinformatics Data Collection and Encoding Distinguish sequences using SVM Results Discussion
3
Expression of genes are under regulation. Right protein, right time, right amount, right location… Regulation: cis-element vs trans-element Cis-element: Non-coding functional sequence Trans-element: Proteins interact with cis-element Predicting cis-regulatory elements remains a challenge: Significant effort put in the past Current trends: TFBS clusters, pattern analysis Regulation: A Recurring Challenge
4
Alignments and Sequences: The Data Information: Sequence Genetics information encoded in DNA sequence Typical information: Codon, Binding site, … Codon: ATG (Met), CGT (Arg.), … Binding sites: A/TGATAA/G ( Gata1 ), … Evolutionary Information: Aligned Sequence Similarity between species Conservation ~ Function Human: TCCTTATCAGCCATTACC Mouse: TCCTTATCAGCCACCACC
5
Problem Given the genome sequence information, is it possible to automatically distinguish Regulatory Regions from other genomic non-coding Neutral sequences using machine learning ?
6
Predicting Genes Machine Learning: The Tool Sub-field of A.I. Computers programs “learn” from experience, i.e. by analyzing data and corresponding behavior Confluence of Statistics, Mathematical Logic, Numerical Optimization Applied in Information Retrieval, Financial Analysis, Computer Vision, Speech Recognition, Robotics, Bioinformatics, etc. Statistics Optimization Logic M.L. Analyzing Stocks Personalized WWW search Applications
7
Machine Learning: Types of Learning Supervised Learning Learning statistical models from past sample-label data pairs, e.g. Classification Unsupervised Learning Building models to capture the inherent organization in data, e.g. Clustering Reinforcement Learning Building models from interactive feedback on how well the current model is doing, e.g. Robotic learning
8
Machine Learning and Bioinformatics: The Confluence Learning problems in Bioinformatics [ICML ’03] Protein folding and protein structure prediction Inference of genetic and molecular networks Gene-protein interactions Data mining from micro arrays Functional and comparative genomics, etc.
9
Identification of DNaseI Hypersensitive Sites in the human genome (may disclose the location of cis-regulatory sequences) W.S. Noble et al., “Predicting the in vivo signature of human gene regulatory sequences,” Bioinformatics, 2005. Functionally classifying genes based on gene expression data from DNA microarray hybridization experiments using SVMs M. P. S. Brown, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” PNAS, 2004. Using Log-odds ratios from Markov models for identifying regulatory regions in DNA sequences L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. Selection of informative genes using an SVM-based feature selection algorithm I. Guyon et al., “Gene selection for cancer classification using support vector machines,” Machine Learning, 2002. Machine Learning and Bioinformatics: Sample Publications
10
Machine Learning and Bioinformatics: Books
11
Support Vector Machines: A Powerful Statistical Learning Technique Which of the linear separators is optimal?
12
Support Vector Machines: A Powerful Statistical Learning Technique ξiξi ξiξi Choose the one that maximizes the margin between the classes
13
0 Support Vector Machines: A Powerful Statistical Learning Technique The classes in these datasets linearly separate easily x What about these datasets ? 0 x x
14
Support Vector Machines: A Powerful Statistical Learning Technique Solution: Kernel Trick ! x2x2 x 0 x
15
Experiments: Overview Classification in Question: Regulatory regions (REG) vs Ancestral Repeats (AR) Two types of experiments: Nucleotide sequences – ATCG Alignments (reduced 5-symbol) - SWVIG (S: match involving G & C, W: match involving A & T, G:gap V:transversion, I: transition) Two datasets: Elnitski et al. dataset Dataset from PennState CCGB Mapping Sequences/Alignments → Real Numbers Frequencies of short length K-mers (K=1, 2, 3) Normalizing factor - sequence length (Ambiguous for K > 1) Stability of variance – Equal length sequences (whenever possible)
16
Total number of features: Sequences: 4 + 4 2 + 4 3 = 84 Alignments: 5 + 5 2 + 5 3 = 155 Relatively high-dimensionality: Curse of dimensionality: Convergence of estimators very slow Over-fitting: Poor generalization performance Solutions: Dimension Reduction – e.g., PCA Feature Selection - e.g., Forward Selection, Backward Elimination Experiments: Feature Selection
17
Training Set: Elnitski et al. dataset Sequences: 300 samples of 100 bp each class (REG and AR) Alignments: 300 samples of length 100 from each class SVM setup: RBF Kernel: k(x 1, x 2 ) = exp( δ || x 1 – x 2 || ) Implementation: LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Validation: N-fold Cross-validation Used in feature selection, parameter tuning, and testing Experiments: Training and Validation
18
Results: The Elnitski et al. dataset Parameter selection SVM Parameters: δ and C Feature Selection Assessing Feature Importance G-C Normalization Sequences: 10 out of 84 Symbols: 10 out of 155 Accuracy scores Overall Ancestral Repeats (AR) Regulatory Regions (Reg)
19
Results: SVM Parameter Selection Iterative selection procedure Coarse selection – Initial neighborhood Fine-grained selection - Brute force Validation Set from data Within-loop CV Chosen Parameters: δ = 1.6 C = 1.5
20
Results: Feature Selection - Sequence Distribution of Nucleotide frequencies of the top 9 most significant k-mers Chosen by One-dimensional SVMs
21
Results: Feature Selection - Symbol Distribution of 5-symbol frequencies of the top 9 most significant k-mers Chosen by One-dimensional SVMs
22
Results: Feature Selection Procedure: Greedy Forward Selection + Backward Elimination Chosen Features: Sequence: [5 68 3 20 63 4 16 10 1 22] ( 0 = A, 1 = T, 2 = G, 3 =C, 4 = AA, 5 = AT, etc. ) [AT,CAA,C,AAA,GGC,AA,CA,TG,T,AAG] Symbol: [3 5 4 18 24 124 17 143 19 95 103] ( 0 = G, 1 = V, 2 = W, 3 =S, 4 = I, 5 = GG, 6 = GV, etc. ) [S, GG, I, WS,SI,SIG,WW,IWI,WI,WSV,WII]
23
Results: Accuracy Scores ExperimentTypeOverall Accuracy Reg Precision AR Precision Elnitski et al.5-symbol Hexamers ≈ 74.7% ≈ 75% 78.49% 81.4% 73% 72.5% Sequences only1-mers 2-mers 3-mers Selection 78.33% 77.67% 80.17% 80.33% 76.54% 72.84% 83.67% 80.87% 80.54 82.97% 77.21% 79.63% Symbols only1-mers 2-mers 3-mers Selection 84.33% 85.17% 86.00% 79.39% 77.53% 78.83 % 80.58% 90.03% 90.96% 92.42% 91.54%
24
Results: Laboratory Data Training: SVM models built using Elnitski et al. data Same parameters; Same features selected Data: 9 candidate cis-regulatory regions predicted by RP score 1: negative control based on the definition. 5 of the 9 candidates passed current biological testing,positive Accuracy Classification result for sequence (1-, 2-, 3-mer): 1 negative control 4 out of 5 positive element + 3 out of 4 “negative” element Classification result for alignment (1-, 2-, 3-mer): 1 negative control 9 original candidates
25
Discussion High validation rate for Ancestral repeat The structure of selected training set is not that diverse Ancestral repeat tends to be AT-rich AR: LINE, SINE etc. SVM performs a little better than RP scores in training set Statistically more powerful RP: Markov model for pattern recognition SVM: Hyper-plane in high-dimensional feature space Feature selection using wrapper method possible
26
Discussion (cont’d) Performance degradation in Lab Data classification No improvement in SVM classification compared to RP score Features identified from the Elnitski et al. data may have some bias – other features may be more informative on the Lab data Sequence classification vs Alignment (Accuracy Table)(Accuracy Table) SVM yields higher overall cross-validation accuracy for aligned symbol sequences compared to nucleotide sequences Gained accuracy rate: Ancestral Repeat driven No improvement for aligned symbol sequence In Lab data classification, sequence classification is better than aligned symbol sequence No information gained from evolutionary history !!! Alphabet reduction not optimal Assumption worng!!!
27
Summary Generally, SVM is a powerful tool for classification Performance better than RP in distinguishing AR training set from Reg training set SVM: answer “yes or no” question RP: Probabilistic method, can generate quantitative measurement genome-wide SVM: Results can be extended using probabilistic forms of SVM SVM can reveal potentially interesting biological features e.g. the transcription regulation scheme
28
Explore more complex features Refine models for neutral non-coding genomic segments Utilize multi-species alignment for the classification Combining sequence and alignment information to build more robust multi-classifiers – “Committee of Experts” Pattern recognition for more accurate prediction Future Directions: Possible extensions
29
Questions and recommendations? Using original alignment features, 20 columns. Other lab data (avoiding the possible bias of RP preselection) for SVM performance testing.
30
References L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. Machine Learning Group, University of Texas at Austin, “Support Vector Machines,” http://www.cs.utexas.edu/~ml/. N. Cristianini, “Support Vector and Kernel Methods for Pattern Recognition,” http://www.support-vector.net/tutorial.html.
31
Acknowledgement Dr. Webb Miller Dr. Francesca Chiaromonte David King
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.