Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner,

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Component Analysis (Review)
Data Mining Classification: Alternative Techniques
Dimension reduction (1)
An Overview of Machine Learning
Chapter 4: Linear Models for Classification
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Statistical Methods Chichang Jou Tamkang University.
Dimensional reduction, PCA
Project 4 out today –help session today –photo session today Project 2 winners Announcements.
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
An Introduction to Support Vector Machines Martin Law.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.
This week: overview on pattern recognition (related to machine learning)
Tin Kam Ho Bell Laboratories Lucent Technologies.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.
Today Ensemble Methods. Recap of the course. Classifier Fusion
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Lecture 4 Linear machine
Discriminant Analysis
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Classification Ensemble Methods 1
Dimensionality reduction
Data Mining and Decision Support
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
2D-LDA: A statistical linear discriminant analysis for image matrix
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Machine Learning Fisher’s Criteria & Linear Discriminant Analysis
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
LECTURE 10: DISCRIMINANT ANALYSIS
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Computer Vision Chapter 4
CS4670: Intro to Computer Vision
CS4670: Intro to Computer Vision
LECTURE 09: DISCRIMINANT ANALYSIS
Announcements Project 2 artifacts Project 3 due Thursday night
Review of Statistical Pattern Recognition
The “Margaret Thatcher Illusion”, by Peter Thompson
Hairong Qi, Gonzalez Family Professor
What is Artificial Intelligence?
Presentation transcript:

Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law, Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia

All Rights Reserved © Alcatel-Lucent Supervised Learning: Many Methods, Data Dependent Performances  Bayesian classifiers,  logistic regression,  linear & polynomial discriminators,  nearest-neighbors,  decision trees & forests,  neural networks,  support vector machines,  ensemble methods, … No clear winners good for all problems Often, accuracy reaches a limit for a practical problem, even with the best known method

All Rights Reserved © Alcatel-Lucent Accuracy Depends on the Goodness of Match between Classifiers and Problems NNXCS error= 0.06% error= 1.9% Better ! Problem A Problem B error= 0.6% error= 0.7% XCS NN Better !

All Rights Reserved © Alcatel-Lucent Measuring Geometrical Complexity of Classification Problems Our goal: tools and languages for studying Characteristics of geometry & topology of high-dim data sets How they change with feature transformations and sampling How they interact with classifier geometry We want to know: What are real-world problems like? What is my problem like? What can be expected of a method on a specific problem?

All Rights Reserved © Alcatel-Lucent Parameterization of Data Complexity

All Rights Reserved © Alcatel-Lucent Some Useful Measures of Geometric Complexity Classical measure of class separability Maximize over all features to find the most discriminating Fisher’s Discriminant RatioDegree of Linear Separability Find separating hyper- plane by linear programming Error counts and distances to plane measure separability Length of Class Boundary Compute minimum spanning tree Count class-crossing edges Shapes of Class Manifolds Cover same-class pts with maximal balls Ball counts describe shape of class manifold

All Rights Reserved © Alcatel-Lucent Real-World Data Sets: Benchmarking data from UC-Irvine archive 844 two-class problems 452 are linearly separable, 392 non-separable Synthetic Data Sets: Random labeling of randomly located points 100 problems in dimensions Using Complexity Measures to Study Problem Distributions Random labeling Linearly separable real-world data Linearly non- separable real- world data Complexity Metric 1 Metric 2

All Rights Reserved © Alcatel-Lucent Measures of Geometrical Complexity

All Rights Reserved © Alcatel-Lucent Distribution of Problems in Complexity Space lin.sep  lin.nonsep  random

All Rights Reserved © Alcatel-Lucent The First 6 Principal Components

All Rights Reserved © Alcatel-Lucent Interpretation of the First 4 PCs PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor PC 2: 12% of variance: Balance between within-class scatter and between-class distance PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class PC 4: 9% of variance: Within-class scatter

All Rights Reserved © Alcatel-Lucent Continuous distribution Known easy & difficult problems occupy opposite ends Few outliers Empty regions Random labels Linearly separable Problem Distribution in 1 st & 2 nd Principal Components

All Rights Reserved © Alcatel-Lucent Relating Classifier Behavior to Data Complexity

All Rights Reserved © Alcatel-Lucent Class Boundaries Inferred by Different Classifiers XCS: a genetic algorithm Nearest neighbor classifier Linear classifier

All Rights Reserved © Alcatel-Lucent Domains of Competence of Classifiers Which classifier works the best for a given classification problem? Can data complexity give us a hint? Complexity metric 1 Metric 2 NN LC XCS Decision Forest ?

All Rights Reserved © Alcatel-Lucent Domain of Competence Experiment Use a set of 9 complexity measures Boundary, Pretop, IntraInter, NonLinNN, NonLinLP, Fisher, MaxEff, VolumeOverlap, Npts/Ndim Characterize 392 two-class problems from UCI data, all shown to be linearly non-separable Evaluate 6 classifiers NN (1-nearest neighbor) LP (linear classifier by linear programming) Odt (oblique decision tree) Pdfc (random subspace decision forest) Bdfc (bagging based decision forest) XCS (a genetic-algorithm based classifier) ensemble methods

All Rights Reserved © Alcatel-Lucent Identifiable Domains of Competence by NN and LP Best Classifier for Benchmarking Data

All Rights Reserved © Alcatel-Lucent Regions in complexity space where the best classifier is (nn,lp, or odt) vs. an ensemble technique Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap  ensemble + nn,lp,odt Less Identifiable Domains of Competence

All Rights Reserved © Alcatel-Lucent Difficulties in Estimating Data Complexity

All Rights Reserved © Alcatel-Lucent Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density 2 points10 points 100 points500 points1000 points Problem may appear deceptively simple or complex with small samples

All Rights Reserved © Alcatel-Lucent Uncertainty of Estimates at Two Levels Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries (uncertainty in feature space) Sparse sample of problems causes difficulty in identifying regions of dominant competence (uncertainty in complexity space)

All Rights Reserved © Alcatel-Lucent Complexity Estimates and Dimensionality Reduction Feature selection/transformation may change the difficulty of a classification problem: Widening the gap between classes Compressing the discriminatory information Removing irrelevant dimensions It is often unclear to what extent these happen We seek quantitative description of such changes Feature selection Discrimination

All Rights Reserved © Alcatel-Lucent Spread of classification accuracy and geometrical complexity due to forward feature selection

All Rights Reserved © Alcatel-Lucent Conclusions

All Rights Reserved © Alcatel-Lucent Summary: Early Discoveries Problems distribute in a continuum in complexity space Several key measures provide independent characterization There exist identifiable domains of classifier’s dominant competency Sparse sampling, feature selection, and feature transformation induce variability in complexity estimates

All Rights Reserved © Alcatel-Lucent For the Future Further progress in statistical learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons. A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms. Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods. Ongoing: DCol: Data Complexity Library ICPR 2010 Contest on Domain of Dominant Competence