Machine Learning Bioinformatics Data Analysis and Tools

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.

Support Vector Machines

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

S UPPORT V ECTOR M ACHINES Jianping Fan Dept of Computer Science UNC-Charlotte.

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},

Classification and Decision Boundaries

Discriminative and generative methods for bags of features

Instance Based Learning

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

K nearest neighbor and Rocchio algorithm

Principal Component Analysis

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Instance based learning K-Nearest Neighbor Locally weighted regression Radial basis functions.

Instance Based Learning

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Machine Learning Lecture 3

Elena Marchiori Department of Computer Science

Support Vector Machines

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

INSTANCE-BASE LEARNING

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Whole Genome Expression Analysis

Lecture 19 aCGH and Microarray Data Analysis Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

Support Vector Machine & Image Classification Applications

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

The Broad Institute of MIT and Harvard Classification / Prediction.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machine (SVM)

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction of Support Vector Machine In part from of Jinwei Gu.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machines

Support Vector Machines

K Nearest Neighbor Classification

COSC 4335: Other Classification Techniques

Support Vector Machines

Support Vector Machines 2

Presentation transcript:

Machine Learning Bioinformatics Data Analysis and Tools V B M S U Lecture 6 Machine Learning Bioinformatics Data Analysis and Tools elena@few.vu.nl

? model Supervised Learning property of interest observations System (unknown) supervisor Train dataset ? ML algorithm new observation model prediction Classification

Unsupervised Learning ML for unsupervised learning attempts to discover interesting structure in the available data Data mining, Clustering

What is your question? What are the targets genes for my knock-out gene? Look for genes that have different time profiles between different cell types. Gene discovery, differential expression Is a specified group of genes all up-regulated in a specified conditions? Gene set, differential expression Can I use the expression profile of cancer patients to predict survival? Identification of groups of genes that predictive of a particular class of tumors? Class prediction, classification Are there tumor sub-types not previously identified? Are there groups of co-expressed genes? Class discovery, clustering Detection of gene regulatory mechanisms. Do my genes group into previously undiscovered pathways? Clustering. Often expression data alone is not enough, need to incorporate sequence and other information

Basic principles of discrimination Each object associated with a class label (or response) Y  {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) Aim: predict Y from X. Predefined Class {1,2,…K} 1 2 K Objects Y = Class Label = 2 X = Feature vector {colour, shape} Classification rule ? X = {red, square} Y = ?

Discrimination and Prediction Learning Set Data with known classes Prediction Classification rule Data with unknown classes Classification Technique Class Assignment Discrimination

Example: A Classification Problem Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon” Use features such as length, width, lightness, fin shape & number, mouth position, etc. Steps Preprocessing (e.g., background subtraction) Feature extraction Classification example from Duda & Hart

Classification in Bioinformatics Computational diagnostic: early cancer detection Tumor biomarker discovery Protein folding prediction Protein-protein binding sites prediction Gene function prediction …

Predefine classes Objects Feature vectors Classification rule Learning set Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Good Prognosis Matesis > 5 ? Predefine classes Clinical outcome Objects Array Feature vectors Gene expression new array Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan. . Classification rule

Classification Techniques K Nearest Neighbor classifier Support Vector Machines …

Instance Based Learning Key idea: just store all training examples <xi,f(xi)> Nearest neighbor: Given query instance xq, first locate nearest training example xn, then estimate f(xq)=f(xn) K-nearest neighbor: Given xq, take vote among its k nearest neighbors (if discrete-valued target function) Take mean of f values of k nearest neighbors (if real-valued) f(xq)=i=1k f(xi)/k

K-Nearest Neighbor A lazy learner … Issues: How many neighbors? What similarity measure? prova

Which similarity or dissimilarity measure? A metric is a measure of the similarity or dissimilarity between two data objects Two main classes of metric: Correlation coefficients (similarity) Compares shape of expression curves Types of correlation: Centered. Un-centered. Rank-correlation Distance metrics (dissimilarity) City Block (Manhattan) distance Euclidean distance

Correlation (a measure between -1 and 1) Pearson Correlation Coefficient (centered correlation) Sx = Standard deviation of x Sy = Standard deviation of y You can use absolute correlation to capture both positive and negative correlation Positive correlation Negative correlation

Potential pitfalls Correlation = 1

Distance metrics City Block (Manhattan) distance: Euclidean distance: Sum of differences across dimensions Less sensitive to outliers Diamond shaped clusters Euclidean distance: Most commonly used distance Sphere shaped cluster Corresponds to the geometric distance into the multidimensional space X Y Condition 1 Condition 2 Y Condition 2 X Condition 1 where gene X = (x1,…,xn) and gene Y=(y1,…,yn)

Euclidean vs Correlation (I) Euclidean distance Correlation

When to Consider Nearest Neighbors Instances map to points in RN Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Do not loose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes

Voronoi Diagram query point qf nearest neighbor qi

3-Nearest Neighbors query point qf 3 nearest neighbors 2x,1o

7-Nearest Neighbors query point qf 7 nearest neighbors 3x,4o

Nearest Neighbor (continuous)

Nearest Neighbor (continuous)

Nearest Neighbor (continuous)

Nearest Neighbor Approximate the target function f(x) at the single query point x = xq Locally weighted regression = generalization of IBL

Curse of Dimensionality Imagine instances are described by 20 attributes but only 10 are relevant to target function Curse of dimensionality: nearest neighbor is easily misled when the instance space is high-dimensional One approach: weight the features according to their relevance! Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error Use cross-validation to automatically choose weights z1,…,zn Note setting zj to zero eliminates this dimension alltogether (feature subset selection)

Practical implementations Weka – IBk Optimized – Timbl

Example: Tumor Classification Reliable and precise classification essential for successful cancer treatment Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables Uncertainties in diagnosis remain; likely that existing classes are heterogeneous Characterize molecular variations among tumors by monitoring gene expression (microarray) Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes) 4

Tumor Classification Using Gene Expression Data Three main types of ML problems associated with tumor classification: Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) Classification of malignancies into known classes (supervised learning – discrimination) Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). 4 relevant to other types of classification problems, not just tumors

Predefine classes Objects Feature vectors Classification Rule Learning set Predefine classes Tumor type B-ALL T-ALL AML T-ALL ? Objects Array Feature vectors Gene expression new array Reference Golub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537. Classification Rule

Nearest neighbor rule

SVM SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight [Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time. Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.

SVM In order to discriminate between two classes, given a training dataset Map the data to a higher dimension space (feature space) Separate the two classes using an optimal linear separator

Feature Space Mapping Map the original data to some higher-dimensional feature space where the training set is linearly separable: Φ: x → φ(x)

The “Kernel Trick” The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2, Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2= = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = = φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

Linear Separators

Optimal hyperplane Support vectors uniquely characterize optimal hyper-plane ρ margin Optimal hyper-plane Support vector

Optimal hyperplane: geometric view

Soft Margin Classification What if the training set is not linearly separable? Slack variables ξi can be added to allow misclassification of difficult or noisy examples. ξj ξk

Weakening the constraints Allow that the objects do not strictly obey the constraints Introduce ‘slack’-variables

Influence of C Erroneous objects can still have a (large) influence on the solution

SVM Advantages: Disadvantages: maximize the margin between two classes in the feature space characterized by a kernel function are robust with respect to high input dimension Disadvantages: difficult to incorporate background knowledge Sensitive to outliers

SVM and outliers outlier

Classifying new examples Given new point x, its class membership is sign[f(x, *, b*)], where Data enters only in the form of dot products! and in general Kernel function

Classification: CV error N samples Training error Empirical error Error on independent test set Test error Cross validation (CV) error Leave-one-out (LOO) N-fold CV splitting 1/n samples for testing N-1/n samples for training Count errors Summarize CV error rate