Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Linear Classifiers (perceptrons)
Data Mining Classification: Alternative Techniques
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Classification and Decision Boundaries
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Efficient Model Selection for Support Vector Machines
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Max-margin Classification of Data with Absent Features Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller Journal of Machine Learning.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Probabilistic Models for Linear Regression
Probabilistic Models with Latent Variables
Presented by: Chang Jia As for: Pattern Recognition
CAMCOS Report Day December 9th, 2015 San Jose State University
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz, Elidan, Abbeel and Koller, JMLR 2008

Outline Introduction Standard SVM Max-Margin Formulation for Missing Features Three Algorithms Experimental Results Conclusions

Introduction (1) Pattern of missing features: due to measurement noise or corruption: existing but unknown due to the inherent properties of the instances: non-existing Example 1: Two subpopulation of instances (animals and buildings) with few overlapping features (body parts, architectural aspects ); Example 2: In a web-page task, one useful feature of a given page may be the most common topic of other sites that point to it, however, this particular page may have no such parents.

Introduction (2) Common methods for handling missing features: (Assume the features exist but their values are unknown) Single imputation: zeros, mean, kNN imputation by building probabilistic generative models Proposed method (Assume the features are structurally absent) : Each data instance resides in a lower dimensional subspace of the feature space, determined by its own existing features. We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower- dimensional subspace.

Standard SVM (1) Binary classification real-valued predictors binary response A classifier could be defined as based on a linear function Parameters

Standard SVM (2) Functional margin for each instance Geometric margin for each instance Geometric margin of a hyper plane SVM:by fixing the functional margin to 1, i.e., ’s: slack variables C: cost Quadratic Programming (QP)

Max-Margin Formulation for Missing Features (1) A 2-D case with missing data margin in the subspace margin in the full feature space Margin of instances with missing features is underestimated.

Max-Margin Formulation for Missing Features (2) Instance margin is non-convex in w is instance dependent and thus cannot be taken out of the minimization It is difficult to solve this optimization problem directly. Optimization problem

Three Algorithms (1) A convex formulation for linearly separable case Introduce a lower bound for For a given, this is a second order cone program (SOCP), which is convex and can be solved efficiently. To find the optimal, do a bisection search over. Unfortunately, extending it to the non-separable case is difficult.

Three Algorithms (2) Average norm: a convex approximation for non-separable case define Get rid of the instance dependence non-separable case

Three Algorithms (3) Geometric margin: an exact non-convex approach for non- separable case define non-separable case QP for a given set of ’s

Three Algorithms (4) Pseudo-code Geometric margin: the exact non-convex approach for non- separable case The convergence is not always guaranteed. Cross validation is used to choose an early stopping point.

Experimental Results (1) Zero. Missing values were set to zero. Mean. Missing values were set to the average value of the feature over all data. Flag. Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance. kNN. Missing features were set with the mean value obtained from the K nearest neighbors instances. EM. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster means, weighted by the posterior probability that a cluster generated the sample. Averaged norm (avg |w|). Proposed approximate convex approach. Geometric margin (geom). Proposed exact non-convex approach.

Experimental Results (2) UCI data sets (missing at random) Remove 90% of the features of each sample randomly Remove a patch covered 25% of pixels with location of the patch uniformly sampled. Digits 5 & 6 from MNIST

Experimental Results (3) Visual object recognition Task: to determine an automobile is present in a given image or not. Local edge information Generative model Likelihood of patches to match each of 19 landmarks Set a threshold (Up to 10) Candidate patches (21-by-21 pixels) for landmarks PCA First 10 principal components for each patch concatenate A feature vector (up to 1900 features) If the number of candidates for a given landmark is less than ten, we consider the rest to be structurally absent

Experimental Results (4) An example image: the best 5 candidates matched to the front windshield landmark

Experimental Results (5)

Experimental Results (6) Metabolic pathway reconstruction A fragment of the full metabolic pathway network Arrows: chemical reactions Purple boxed names: enzymes

Experimental Results (7) Three types of neighborhood relations between enzyme pairs:  Linear chains (ARO7, PHA2)  Forks (TRP2, ARO7): same input, different outputs  Funnels (ARO9, PHA2): same output, different inputs One feature vector (represents an enzyme) Features for linear chain neighbor Features for fork neighbor Features for funnel neighbor A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors, e.g., PHA2 does not have a neighbor of type fork.

Experimental Results (8) Task: to identify if a candidate enzyme is in the right “neighborhood”. Data creation:  Positive samples: from the reactions with known enzymes (in the right “neighborhood”);  Negative samples: for each positive sample, replace the true enzyme with a random impostor, and calculate the features in such a wrong “neighborhood”. The impostor was uniformly chosen from the set of other enzymes.

Experimental Results (9)

Conclusions 1.The authors presented a modified SVM model for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. 2.The authors directly classified instances by skipping the non- existing features, rather than filling them with hypothetical values. 3.The proposed model was competitive with a range of single imputation approaches when tested in missing-at-random (MAR) settings. 4.One variant (geometric margin) significantly outperformed other methods in two real problems with non-existing features.