1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.

An Introduction of Support Vector Machine
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Mutual Information Mathematical Biology Seminar
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Feature Selection Lecture 5
Feature Selection Bioinformatics Data Analysis and Tools
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Chapter 9 – Classification and Regression Trees
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Scaling up Decision Trees. Decision tree learning.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci
Consensus Group Stable Feature Selection
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
An Improved Algorithm for Decision-Tree-Based SVM Sindhu Kuchipudi INSTRUCTOR Dr.DONGCHUL KIM.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Semi-Supervised Clustering
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Perceptrons Lirong Xia.
Basic machine learning background with Python scikit-learn
Molecular Classification of Cancer
Boosting For Tumor Classification With Gene Expression Data
iSRD Spam Review Detection with Imbalanced Data Distributions
CSCI N317 Computation for Scientific Applications Unit Weka
Chapter 7: Transformations
Perceptrons Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai

2 Agenda (1) Lymphoma Dataset Description (2) Data Preprocessing - Formatting - Dealing with Missing Values - Gene Selections (3) Multi-class SVM Classification - 1-against-all - 1-against-1 (4) Tools (5) References

3 Lymphoma Dataset Alizadeh et al.(2000), Distinct Types of Diffuse Large B- cell Lymphoma Identified by Gene Expression Profiling Publicly available at In microarray data, Expression profiling of g enes are measured in rows Samples are columns

4 Lymphoma Dataset 96 samples of lymphocytes (instances) 4026 human genes (features) 9 classes of lymphoma: DLBCL, GCB, NIL, ABB, RAT, TCL, FL, RBB, CLL Glimpse of data DLCL DLCL DLCL CLL -60 CLL- 68 FL-10FL-11GCBNIL- IgM CLL- 65 GENE2406X GENE3689X GENE3133X GENE1008X

5 Lymphoma Dataset

6 Goal Task: classification Assign each patient sample to one of 9 categories, e.g. Diffuse Large B-cell Lymphoma (DLBCL) or Chronic Lymphocytic Leukemia (CLL). Microarray data classification: an alternative to current malignancies classification that relies on morphological or clinical variables Medical implications Precise categorization of cancers; more relevant diagnosis More accurate assignment of cases to high risk or low risk categories more targeted therapies Improved predictability of outcome.

7 Data Preprocessing Missing Values Imputation 3% of gene expression profiles data are missing 1980 of the 4026 genes have missing values 49.1% of genes (features) involved Some of these genes may be highly informative for classification Need to deal with missing values before applying to SVM

8 Missing Value Approaches Instance or feature deletion - if dataset large enough & does not distort distribution Replace with a randomly drawn observed value - proved to work well ( EM algorithm Global mode or mean substitution - will distort distribution Local mode or mean substitution with KNN algorithm (Prof. Domeniconi)

9 Local Mean Imputation (KNN) 1.Partition the data set D into two sets. Let the first set, D m, contain instances with missing value(s). The other set, D c, contains instances with complete values. 2. For each instance vector x  D m Divide the vector into observed and missing parts as x = [x o ; x m ]. Calculate the distance between x o and every instance y  D c, using only those features that are observed in x. From the K closest y’s (instances in D c ), calculate the mean of the feature for which x has missing value(s). Make substitution with this local mean. (Note: for nominal features use mode. n/a in microarray data)

10 Data Preprocessing Feature Selection: Motivations - Number of features large, instances small - Reduce dimensionality to overcome overfitting - A small number of discriminant “marker” genes may characterize different cancer classes Example: Guyon et al. identified 2 genes that yield zero leave- one-out error in the leukemia dataset, 4 genes in the colon cancer dataset that give 98% accuracy. (Guyon et al. Gene Selection for Cancer Classification using SVM, 2002)

11 Feature Selection Discriminant Score Ranking Which gene is more informative in the 2-class case: Gene 1 Gene 2

12 Separation Score Gene 1 more discriminant. Criteria: - Large difference of μ+ and μ- - Small variance within each class Score function F(g j ) = | ( μ j+ - μ j- ) / ( σ j+ + σ j- ) |

13 Separation Score In multi-class cases, rank genes that are discriminant among multiple classes C 1 C 2 Δ C 3 A gene may functionally relates to several cancer classes such as C 1 and C 2

14 Separation Score Proposing an adapted score function For each gene j Calculate μ i in each class C i Sort μ i in descending order Find a cutoff point with largest diff(μ i, μ j ) μ +  μ exp-cutoff-left σ +  σ exp-cutoff-left μ -  μ exp-cutoff-right σ -  σ exp-cutoff-right F( g j ) = | (μ j+ - μ j -) / (σ j+ + σ j- ) | Rank genes by F(gj) Select top genes via threshold

15 Separation Score Disadvantage: Does not yield more compact gene sets; still abundant Does not consider mutual information between genes

16 Feature Selection Recursive Feature Elimination/SVM 1.In the linear SVM model on the full feature set Sign (wx + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.

17 RFE/SVM 2. When w is computed for the full feature set, sort features according in descending order of weights. The lower half is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process until the set of predictors is non-divisible by two. 4. The best feature subset is chosen.

18 Feature Selection PCA comment: not common in microarray data. Disadvantage: none of original inputs can be discarded We want to retain a minimum subset of informative genes to achieve best classification performance.

19 Multi-class SVM

20 Multi-class SVM Approaches 1-against-all Each of the SVMs separates a single class from all remaining classes (Cortes and Vapnik, 1995) 1-against-1 Pair-wise. k(k-1)/2, k  Y SVMs are trained. Each SVM separates a pair of classes (Fridman, 1996) Performance similar in some experiments (Nakajima, 2000) Time complexity similar : k evaluation in 1-all, k -1 in 1-1

21 1 -against- All Or “one-against-rest”, a tree algorithm Decomposed to a collection of binary classifications k decision functions, one for each class (w k ) T  x+b k, k  Y The k th classifier constructs a hyperplane between class n and the k -1 other classes Class of x = argmax i { (w i )T  (x)+b i }

22 1 -against- 1 k(k-1)/2 classifiers where each one is trained on data from two classes For training data from i th and j th classes, run binary classification Voting strategy: If Sign (w ij ) T  x+b ij ) says x is in class i, then add 1 to class i. Else to class j. Assign x to class with largest vote (Max wins)

23 Kernels to Experiment Polynomial kernels K(X i, X j )=(X i X j +1)^d Gaussian Kernels K(X i, X j )=e^(-|| X i - X j ||/σ^2)

24 SVM Tools - Weka Data Preprocessing To ARFF format Import file

25 SVM Tools - Weka Feature Selection using SVM Select Attribute SVMAttributeEval

26 SVM Tools - Weka Multi-class classifier Classify Meta MultiClassClassifier (Handles multi-class datasets with 2-class classifiers)

27 SVM Tools - Weka Multi-class SVM Classify Functions SMO (Weka’s SVM)

28 SVM Tools - Weka Multi-class SVM Options Method 1-against-1 1-against-all Kernel options not found

29 Multi-class SVM Tools Other Tools include SVMTorch (1-against-all) LibSVM (1-against-1) LightSVM

30 References Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, 1999 Cristianini, An Introduction to Support Vector Machines, 2000 Dor et al, Scoring Genes for Relevance, 2000 Franc and Hlavac, Multi-class Support Vector Machines Furey et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data, 2000 Guyon et al. Gene Selection for Cancer Classification using Support Vector Machines, 2002 Selikoff, The SVM-Tree Algorithm, A New Method for Handling Multi-class SVM, 2003 Shipp et al. Diffuse Large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning, 2002 Weston, Multi-class Support Vector Machines, Technical Report, 1998