Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

Slides:

Advertisements

Similar presentations

Learning Riemannian metrics for motion classification Fabio Cuzzolin INRIA Rhone-Alpes Computational Imaging Group, Pompeu Fabra University, Barcellona.

Advertisements

Machine Learning for Vision-Based Motion Analysis Learning pullback metrics for linear models Oxford Brookes Vision Group Oxford Brookes University 17/10/2008.

ECG Signal processing (2)

Component Analysis (Review)

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

Support vector machine

Dimension reduction (1)

An Overview of Machine Learning

COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –

Chapter 4: Linear Models for Classification

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Principal Component Analysis

Dimensional reduction, PCA

Principle of Locality for Statistical Shape Analysis Paul Yushkevich.

Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.

An Introduction to Support Vector Machines Martin Law.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

This week: overview on pattern recognition (related to machine learning)

Tin Kam Ho Bell Laboratories Lucent Technologies.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.

An Introduction to Support Vector Machines (M. Law)

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.

1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.

A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner,

Discriminant Analysis

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Data Mining and Decision Support

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

2D-LDA: A statistical linear discriminant analysis for image matrix

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

k-Nearest neighbors and decision tree

LECTURE 11: Advanced Discriminant Analysis

IMAGE PROCESSING RECOGNITION AND CLASSIFICATION

School of Computer Science & Engineering

LECTURE 10: DISCRIMINANT ANALYSIS

Supervised Time Series Pattern Discovery through Local Importance

Basic machine learning background with Python scikit-learn

Machine Learning Basics

K Nearest Neighbor Classification

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

REMOTE SENSING Multispectral Image Classification

CSc4730/6730 Scientific Visualization

COSC 4335: Other Classification Techniques

Computer Vision Chapter 4

CS4670: Intro to Computer Vision

LECTURE 09: DISCRIMINANT ANALYSIS

Presentation transcript:

Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law, Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia

All Rights Reserved © Alcatel-Lucent Pattern Recognition: Research vs. Practice Steps to solve a practical pattern recognition problem Feature Extraction Classifier Training Classification Sensory Data Decision Feature Vectors Classifier Data Collection Study of the Problem Context Study of the Mathematical Solution Practical Focus Research Focus Danger of Disconnection

All Rights Reserved © Alcatel-Lucent Reconnecting Context and Solution Feature Vectors Study of the Problem Context Study of the Mathematical Solution To understand how such properties may impact the classification solution To understand how changes in the problem set-up and data collection procedures may affect such properties Data Complexity Analysis: Analysis of the properties of feature vectors Improvements Limitations Expectations

All Rights Reserved © Alcatel-Lucent Kolmogorov complexity Boundary length can be exponential in dimensionality A trivial description is to list all points & class labels Is there a shorter description? Focus is on Boundary Complexity

All Rights Reserved © Alcatel-Lucent Early Discoveries Problems distribute in a continuum in complexity space Several key measures provide independent characterization There exist identifiable domains of classifier’s dominant competency Feature selection and transformation induce variability in complexity estimates

All Rights Reserved © Alcatel-Lucent Parameterization of Data Complexity

All Rights Reserved © Alcatel-Lucent Complexity Classes vs. Complexity Scales Study is driven by observed limits in classifier accuracy, even with new, sophisticated methods (e.g., ensembles, SVM, …) Analysis is needed for each instance of a classification problem, not just the worst case of a family of problems Linear separability: the earliest attempt to address classification complexity Observed in real-world problems: different degrees of linear non-separability Continuous scale is needed

All Rights Reserved © Alcatel-Lucent Some Useful Measures of Geometric Complexity Classical measure of class separability Maximize over all features to find the most discriminating Fisher’s Discriminant RatioDegree of Linear Separability Find separating hyper- plane by linear programming Error counts and distances to plane measure separability Length of Class Boundary Compute minimum spanning tree Count class-crossing edges Shapes of Class Manifolds Cover same-class pts with maximal balls Ball counts describe shape of class manifold

All Rights Reserved © Alcatel-Lucent Real-World Data Sets: Benchmarking data from UC-Irvine archive 844 two-class problems 452 are linearly separable, 392 non-separable Synthetic Data Sets: Random labeling of randomly located points 100 problems in dimensions Continuous Distributions in Complexity Space Random labeling Linearly separable real-world data Linearly non- separable real- world data Complexity Metric 1 Metric 2

All Rights Reserved © Alcatel-Lucent Measures of Geometrical Complexity

All Rights Reserved © Alcatel-Lucent The First 6 Principal Components

All Rights Reserved © Alcatel-Lucent Interpretation of the First 4 PCs PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor PC 2: 12% of variance: Balance between within-class scatter and between-class distance PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class PC 4: 9% of variance: Within-class scatter

All Rights Reserved © Alcatel-Lucent Continuous distribution Known easy & difficult problems occupy opposite ends Few outliers Empty regions Random labels Linearly separable Problem Distribution in 1 st & 2 nd Principal Components

All Rights Reserved © Alcatel-Lucent Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density 2 points10 points 100 points500 points1000 points Problem may appear deceptively simple or complex with small samples

All Rights Reserved © Alcatel-Lucent Observations Problems distribute in a continuum in complexity space Several key measures/dimensions provide independent characterization Need further analysis on uncertainty in complexity estimates due to small sample size effects

All Rights Reserved © Alcatel-Lucent Relating Classifier Behavior to Data Complexity

All Rights Reserved © Alcatel-Lucent Class Boundaries Inferred by Different Classifiers XCS: a genetic algorithm Nearest neighbor classifier Linear classifier

All Rights Reserved © Alcatel-Lucent Accuracy Depends on the Goodness of Match between Classifiers and Problems NNXCS error= 0.06% error= 1.9% Better ! Problem A Problem B error= 0.6% error= 0.7% XCS NN Better !

All Rights Reserved © Alcatel-Lucent Domains of Competence of Classifiers Given a classification problem, we want determine which classifier is the best for it. Can data complexity give us a hint? Complexity metric 1 Metric 2 NN LC XCS Decision Forest ? Here is my problem !

All Rights Reserved © Alcatel-Lucent Domain of Competence Experiment Use a set of 9 complexity measures Boundary, Pretop, IntraInter, NonLinNN, NonLinLP, Fisher, MaxEff, VolumeOverlap, Npts/Ndim Characterize 392 two-class problems from UCI data, all shown to be linearly non-separable Evaluate 6 classifiers NN (1-nearest neighbor) LP (linear classifier by linear programming) Odt (oblique decision tree) Pdfc (random subspace decision forest) Bdfc (bagging based decision forest) XCS (a genetic-algorithm based classifier) ensemble methods

All Rights Reserved © Alcatel-Lucent Identifiable Domains of Competence by NN and LP Best Classifier for Benchmarking Data

All Rights Reserved © Alcatel-Lucent Regions in complexity space where the best classifier is (nn,lp, or odt) vs. an ensemble technique Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap ensemble + nn,lp,odt Less Identifiable Domains of Competence

All Rights Reserved © Alcatel-Lucent Uncertainty of Estimates at Two Levels Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries (uncertainty in feature space) Sparse sample of problems causes difficulty in identifying regions of dominant competence (uncertainty in complexity space)

All Rights Reserved © Alcatel-Lucent Complexity and Data Dimensionality: Class Separability after Dimensionality Reduction Feature selection/transformation may change the difficulty of a classification problem: Widening the gap between classes Compressing the discriminatory information Removing irrelevant dimensions It is often unclear to what extent these happen We seek quantitative description of such changes Feature selection Discrimination

All Rights Reserved © Alcatel-Lucent Spread of classification accuracy and geometrical complexity due to forward feature selection

All Rights Reserved © Alcatel-Lucent Designing a Strategy for Classifier Evaluation

All Rights Reserved © Alcatel-Lucent A Complete Platform for Evaluating Learning Algorithms To facilitate progress on learning algorithms: Need a way to systematically create learning problems Provide a complete coverage of the complexity space Be representative of all the known problems i.e., every classification problem arising in the real-world should have a close neighbor representing it in the complexity space. Is this possible?

All Rights Reserved © Alcatel-Lucent Ways to Synthesize Classification Problems Synthesizing data with targeted levels of complexity e.g. compute MST over a uniform point distribution, then assign class-crossing edges randomly [Macia et al. 2008] or, create partitions with increasing resolution can create continuous cover of complexity space but, are the data similar to those arising from reality?

All Rights Reserved © Alcatel-Lucent Ways to Synthesize Classification Problems Synthesizing data to simulate natural processes e.g. Neyman-Scott process how many such processes have explicit models? how many are needed to cover all real-world problems? Systematically degrade real-world datasets increase noise, reduce image resolution, …

All Rights Reserved © Alcatel-Lucent Simplification of Class Geometry

All Rights Reserved © Alcatel-Lucent Manifold Learning and Dimensionality Reduction Manifold learning techniques that highlight intrinsic dimensions But the class boundary may not follow the intrinsic dimensions

All Rights Reserved © Alcatel-Lucent Manifold Learning and Dimensionality Reduction Supervised manifold learning – seek mappings that exaggerate class separation [de Ridder et al., 2003] Best, the mapping should be sought to directly minimize some measures of data complexity

All Rights Reserved © Alcatel-Lucent Seeking Optimizations Upstream Back to the application context: Use data complexity measures for guidance Change the setup, definition of the classification problem Collect more samples, in finer resolution, extract more features … Alternative representations: dissimilarity-based? [Pekalska & Duin 2005] Data complexity gives an operational definition of learnability Optimization in the upstream: formalize the intuition of seeking invariance, systematically optimize the problem setup and data acquisition scenario to reduce data complexity

All Rights Reserved © Alcatel-Lucent Recent Examples from the Internet

All Rights Reserved © Alcatel-Lucent CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart Also known as Reverse Turing Test Human Interactive Proofs [von Ahn et al., CMU 2000] Exploit limitations in accuracy of machine pattern recognition

All Rights Reserved © Alcatel-Lucent The Netflix Challenge $1 Million Prize for the first team to improve 10% over the company’s own recommender system But, is the goal achievable? Do the training data support such possibility?

All Rights Reserved © Alcatel-Lucent Amazon’s Mechanical Turk “Crowd-sourcing” tedious human intelligence (pattern recognition) tasks Which ones are doable by machines?

All Rights Reserved © Alcatel-Lucent Conclusions

All Rights Reserved © Alcatel-Lucent Summary Automatic classification is useful, but can be very difficult. We know the key steps and many promising methods. But we have not fully understood how they work, what else is needed. We found measures for geometric complexity that are useful to characterize difficulties of classification problems and classifier domains of competence. Better understanding of how data and classifiers interact can guide practice, and re-establish the linkage between context and solution.

All Rights Reserved © Alcatel-Lucent For the Future Further progress in statistical and machine learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons. A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms. Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods.