Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
ECG Signal processing (2)
ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Standard electrode arrays for recording EEG are placed on the surface of the brain. Detection of High Frequency Oscillations Using Support Vector Machines:
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Presentation in IJCNN 2004 Biased Support Vector Machine for Relevance Feedback in Image Retrieval Hoi, Chu-Hong Steven Department of Computer Science.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Today Evaluation Measures Accuracy Significance Testing
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Part II Support Vector Machine Algorithms. Outline  Some variants of SVM  Relevant algorithms  Usage of the algorithms.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Support vector machines for classification Radek Zíka
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
컴퓨터 과학부 김명재.  Introduction  Data Preprocessing  Model Selection  Experiments.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Weka. Weka A Java-based machine vlearning tool Implements numerous classifiers and other ML algorithms Uses a common.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
ECE 471/571 – Lecture 22 Support Vector Machine 11/24/15.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
SVMs in a Nutshell.
Support Vector Machine (SVM) Presented by Robert Chen.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
Building Machine Learning System with Python
Evaluating Classifiers
Project 4: Facial Image Analysis with Support Vector Machines
Customer Satisfaction Based on Voice
An Enhanced Support Vector Machine Model for Intrusion Detection
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
A Comparative Study of Kernel Methods for Classification Applications
AHED Automatic Human Emotion Detection
Implementing AdaBoost
Support Vector Machines 2
Presentation transcript:

Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia

Overview Problem Description Data Description and Specification Data Preprocessing Methodology Discussion of Results

The Problem Identify if two proteins belong to same fold [Binary Classification] Protein [Structure] Database Protein 1 Protein 2 Protein1-Protein Feature Feature Feature 3 Protein 3 Protein2-Protein Protein Pair Y Same Fold? N

The Problem Identify if two proteins belong to same fold [Binary Classification] Protein1-Protein Feature Feature Feature 3 Protein2-Protein Protein Pair Y Same Fold? N Protein fold recognition Customer Feature Feature Feature 3 Customer Customer Identification Y Potential Customer Y Customer N Recognizing Potential New Customers

Data Specification File size: 1.5G Examples count: Positive(+1) labels:7438 Negative(-1) labels: Number of features: 84

Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2

Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Protein query-target pair as example id

Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Labels

Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Feature values for each examples

Preprocessing Task 1: Group related data rows Problem : All The records are not independent of each other. Solution: Group records with same query template together, so that they are together either in the test data set or training data set. #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #1aab-d1aab 1cyx-d1cyx -1 1:0.83 2:1.58 3: : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : :

Preprocessing Task 2: Balance the positive and negative data Problem: Dataset has just 0.78% positive examples All Examples All +ve examples -ve examples (equal to +ve) Balancing the number of positive and negative examples All Random Remaining -ve examples Used only for testing Balanced examples

Methodology SVM light as the mining tool SVM light is an implementation of Support Vector Machines (SVMs) in C $ svm_learn example1/train.dat example1/model $ svm_classify example1/test.dat example1/model example1/prediction

Methodology Process for deciding the Kernel Function Many different kernels: linear, polynomial, radial basis function, or user defined. Consider the RBF kernel K(x, y) = Parameters to consider: -mmemory size of cache for kernel evaluations - ggamma value for rbf kernel -ctrade-off between training error and margin Use cross-validation to find the best parameter C and ϒ Use the best parameter C and ϒ to train the whole training set

Determining gamma parameter: – Ran training and testing for 100 gamma values between 0 and 1 – Found gamma = 0.15 as the best value – Ran again to find more precise gamma for 120 values from 0 to 0.3 – Found best value of gamma as 0.1 Used default C value of 0 Methodology Parameter Determination

ROC curve For different values of threshold - average sensitivity and specificity was computed from values in each fold threshold Evaluation with 10-fold cross-validation For threshold = ThresholdSensivitySpecificityFPRAccuracyPrecision specificity sensitivity

References A machine learning information retrieval approach to protein fold recognition by Jianlin Cheng and Pierre Baldi A Practical Guide to Support Vector Classification by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin available at Cross-Validation by PAYAM REFAEILZADEH, LEI TANG, HUAN LIU available at Classroom slides at ppt ppt

Thank you for your time Questions and comments are welcome.