Predicting E. Coli Promoters Using SVM

Slides:

Advertisements

Similar presentations

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

SVM—Support Vector Machines

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Instance Based Learning IB1 and IBK Small section in chapter 20.

SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

Who would be a good loanee? Zheyun Feng 7/17/2015.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

CLassification TESTING Testing classifier accuracy

Efficient Model Selection for Support Vector Machines

Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.

Chapter 9 – Classification and Regression Trees

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Scaling up Decision Trees. Decision tree learning.

UNIT 6: STOICHIOMETRY PART 2: STOICHIOMETRY. KEY TERMS Actual yield - Amount of product was actually made in a reaction Dimensional analysis - The practice.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

컴퓨터 과학부 김명재.  Introduction  Data Preprocessing  Model Selection  Experiments.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Taylor Rassmann.  Look at a confusion matrix of the UCF50 dataset  Dollar Features  Idea of a structured tree SVM  Different configurations have different.

ECE 471/571 – Lecture 22 Support Vector Machine 11/24/15.

Part of a set or part of a whole. 3 4 =Numerator the number of parts = Denominator the number that equals the whole.

 Find the distance between the points, (10, 5) and (40, 45). Round to the nearest hundredths. Warm up 50.

SVMs in a Nutshell.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.

Data Science Credibility: Evaluating What’s Been Learned

7. Performance Measurement

Support Vector Machines

PREDICT 422: Practical Machine Learning

Support Vector Machine 04/26/17

How to forecast solar flares?

Evaluating Classifiers

Disease risk prediction

Computational Intelligence: Methods and Applications

Decision Trees.

National Taiwan University

Chapter 6 Classification and Prediction

Data Science Algorithms: The Basic Methods

Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah

Empirical Evaluation (Ch 5)

Chapter 2 Presenting Data in Tables and Charts

Machine Learning Techniques for Data Mining

Evaluation and Its Methods

Experiments in Machine Learning

Tutorial for LightSIDE

Implementing AdaBoost

CSCI N317 Computation for Scientific Applications Unit Weka

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Data Transformations targeted at minimizing experimental variance

Support vector machines

Nearest Neighbors CSC 576: Data Mining.

Chapter 7: Transformations

Evaluation and Its Methods

CS639: Data Management for Data Science

Evaluation and Its Methods

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Lecture 16. Classification (II): Practical Considerations

Support Vector Machines 2

Machine Learning: Lecture 5

Austin Karingada, Jacob Handy, Adviser : Dr

Presenter: Donovan Orn

Presentation transcript:

Predicting E. Coli Promoters Using SVM DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu

Purpose Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? For more theoretical information about E. Coli promoter: http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txt

Dataset Data file is obtained from Dataset information: http://homepages.cae.wisc.edu/~ece539/data/gene/data.txt Dataset information: Number of instances: 106 Attributes: Number of attributes: 57 Type: Non-numeric nominal values (A, C, G, or T) Classes: Number of classes: 2 Type: Positive (+1) or Negative (-1)

Data preprocessing Randomly partition the dataset to TRAINSET and TESTSET Ratio = TESTSET / (TRAINSET + TESTSET) Encode non-numeric attributes A  00012 = 110 C  00102 = 210 G  01002 = 410 T  10002 = 810 Scaling each feature to [-1, 1] to avoid the domination of large on small values.

Approach RBF kernel is used  need to find “good” C (cost) and G (gamma) parameters. Parameter scanning: Set the range of C to [2-15, 25] and G to [2-15, 22] For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. Training/Testing: Use selected parameters and the whole TRAINSET to train the system. Use the trained system to predict the TRAINSET. preferred accuracy rate = 100% Use the trained system to predict the TESTSET.

Accuracy rate of the testing process Results Configuration: Ratio of partitioning dataset = 1/5 Split the dataset to 5 roughly equal sets; one is preserved as TESTSET K-fold = 15 (15 folds in total) Number of repetitions to select paras. = 10 After running the system several times: Training result Accuracy rate “Best” (C, G) Avg. Best C G 84.35% 84.23% 88.82% 83.52% 85.88% 90.59% 0.7071 1.1892 1.0000 0.0371 0.0313 0.0743 0.0625 Accuracy rate of the testing process Occr. freq. TRAINSET TESTSET Often Sometimes Rare 85/85 = 100% 19/21 = 90.48% 18/21 = 85.71% 20/21 = 95.23% 21/21 = 100% 15/21 = 71.43%

Observation/Conclusion SVM: For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: Split the range to “large” equal intervals Pick the interval that yields high accuracy rates Divide this range to smaller equal intervals Repeat K-fold method: The larger the number of folds is, the more time the process requires For this dataset, the number of instances is not large, large numbers of folds seem to work well.