Predicting E. Coli Promoters Using SVM

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
SVM—Support Vector Machines
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluation.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Instance Based Learning IB1 and IBK Small section in chapter 20.
SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Who would be a good loanee? Zheyun Feng 7/17/2015.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
CLassification TESTING Testing classifier accuracy
Efficient Model Selection for Support Vector Machines
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
Chapter 9 – Classification and Regression Trees
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Scaling up Decision Trees. Decision tree learning.
UNIT 6: STOICHIOMETRY PART 2: STOICHIOMETRY. KEY TERMS Actual yield - Amount of product was actually made in a reaction Dimensional analysis - The practice.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
컴퓨터 과학부 김명재.  Introduction  Data Preprocessing  Model Selection  Experiments.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Taylor Rassmann.  Look at a confusion matrix of the UCF50 dataset  Dollar Features  Idea of a structured tree SVM  Different configurations have different.
ECE 471/571 – Lecture 22 Support Vector Machine 11/24/15.
Part of a set or part of a whole. 3 4 =Numerator the number of parts = Denominator the number that equals the whole.
 Find the distance between the points, (10, 5) and (40, 45). Round to the nearest hundredths. Warm up 50.
SVMs in a Nutshell.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine 04/26/17
How to forecast solar flares?
Evaluating Classifiers
Disease risk prediction
Computational Intelligence: Methods and Applications
Decision Trees.
National Taiwan University
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Empirical Evaluation (Ch 5)
Chapter 2 Presenting Data in Tables and Charts
Machine Learning Techniques for Data Mining
Evaluation and Its Methods
Experiments in Machine Learning
Tutorial for LightSIDE
Implementing AdaBoost
CSCI N317 Computation for Scientific Applications Unit Weka
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Data Transformations targeted at minimizing experimental variance
Support vector machines
Nearest Neighbors CSC 576: Data Mining.
Chapter 7: Transformations
Evaluation and Its Methods
CS639: Data Management for Data Science
Evaluation and Its Methods
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Machine Learning: Lecture 5
Austin Karingada, Jacob Handy, Adviser : Dr
Presenter: Donovan Orn
Presentation transcript:

Predicting E. Coli Promoters Using SVM DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu

Purpose Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? For more theoretical information about E. Coli promoter: http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txt

Dataset Data file is obtained from Dataset information: http://homepages.cae.wisc.edu/~ece539/data/gene/data.txt Dataset information: Number of instances: 106 Attributes: Number of attributes: 57 Type: Non-numeric nominal values (A, C, G, or T) Classes: Number of classes: 2 Type: Positive (+1) or Negative (-1)

Data preprocessing Randomly partition the dataset to TRAINSET and TESTSET Ratio = TESTSET / (TRAINSET + TESTSET) Encode non-numeric attributes A  00012 = 110 C  00102 = 210 G  01002 = 410 T  10002 = 810 Scaling each feature to [-1, 1] to avoid the domination of large on small values.

Approach RBF kernel is used  need to find “good” C (cost) and G (gamma) parameters. Parameter scanning: Set the range of C to [2-15, 25] and G to [2-15, 22] For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. Training/Testing: Use selected parameters and the whole TRAINSET to train the system. Use the trained system to predict the TRAINSET. preferred accuracy rate = 100% Use the trained system to predict the TESTSET.

Accuracy rate of the testing process Results Configuration: Ratio of partitioning dataset = 1/5 Split the dataset to 5 roughly equal sets; one is preserved as TESTSET K-fold = 15 (15 folds in total) Number of repetitions to select paras. = 10 After running the system several times: Training result Accuracy rate “Best” (C, G) Avg. Best C G 84.35% 84.23% 88.82% 83.52% 85.88% 90.59% 0.7071 1.1892 1.0000 0.0371 0.0313 0.0743 0.0625 Accuracy rate of the testing process Occr. freq. TRAINSET TESTSET Often Sometimes Rare 85/85 = 100% 19/21 = 90.48% 18/21 = 85.71% 20/21 = 95.23% 21/21 = 100% 15/21 = 71.43%

Observation/Conclusion SVM: For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: Split the range to “large” equal intervals Pick the interval that yields high accuracy rates Divide this range to smaller equal intervals Repeat K-fold method: The larger the number of folds is, the more time the process requires For this dataset, the number of instances is not large, large numbers of folds seem to work well.