Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Component Analysis (Review)
Multiclass SVM and Applications in Object Classification
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Linear Classifiers (perceptrons)
Support Vector Machines
Machine learning continued Image source:
Minimum Redundancy and Maximum Relevance Feature Selection
Principal Component Analysis
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
Dimensionality Reduction
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
An Introduction to Support Vector Machines Martin Law.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
This week: overview on pattern recognition (related to machine learning)
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
An Introduction to Support Vector Machines (M. Law)
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Linear Models for Classification
Discriminant Analysis
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Dimensionality reduction
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
2D-LDA: A statistical linear discriminant analysis for image matrix
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support Feature Machine for DNA microarray data
Boosted Augmented Naive Bayes. Efficient discriminative learning of
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
Recognition with Expression Variations
Trees, bagging, boosting, and stacking
Outlier Processing via L1-Principal Subspaces
Dimensionality reduction
Basic machine learning background with Python scikit-learn
Machine Learning Basics
CAMCOS Report Day December 9th, 2015 San Jose State University
Outline Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no.
Introduction PCA (Principal Component Analysis) Characteristics:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Ernest Valveny Computer Vision Center
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
A task of induction to find patterns
CAMCOS Report Day December 9th, 2015 San Jose State University
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Modeling IDS using hybrid intelligent systems
Clustering.
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.
Presentation transcript:

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar

Pattern Classification Given a sample x Find the label corresponding to it A classifier is an algorithm, which takes x and returns the label between 1 to N Binary Classification -- N = 2 Multiclass classification -- N > 2 Evaluation is usually done as probability of correct classification

Multiclass Classification Many standard approaches Neural Networks, Decision Trees Direct extensions Combinations of component classifiers

Decision Directed Acyclic Graph 1,5 2,5 1,4 3,5 2,4 1,3 4,5 3,4 2,3 1,2 x 5 4 1 2 3 Sample x from class 3 Neural network vs decision dag

Decision Directed Acyclic Graph 1,5 2,5 1,4 3,5 2,4 1,3 4,5 3,4 2,3 1,2 x 5 4 1 2 3 Sample x from class 5 Neural network vs decision dag

Decision Directed Acyclic Graph x 1,5 Sample x from class 4 2,5 1,4 3,5 2,4 1,3 Neural network vs decision dag 4,5 2,3 1,2 3,4 4 3 2 1 5

Decision Directed Acyclic Graph 1,5 2,5 1,4 3,5 2,4 1,3 4,5 3,4 2,3 1,2 x 5 4 1 2 3 There are multiple paths Neural network vs decision dag

Decision Directed Acyclic Graph x 1,5 A DDAG can be improved by improving individual nodes 2,5 1,4 3,5 2,4 1,3 Neural network vs decision dag 4,5 2,3 1,2 3,4 5 4 3 2 1

Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 1,5 2,5 1,4 Architecture is fixed for a given sequence of classes 3,5 2,4 1,3 Neural network vs decision dag 4,5 2,3 1,2 3,4 5 4 3 2 1

Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 3,5 2,5 3,4 A DDAG can be improved by changing class order 1,5 2,4 3,1 Neural network vs decision dag 4,5 2,1 3,2 1,4 5 4 1 2 3 Class Order Changed

Features at Each Node Image as Features Large number of features in Computer vision problems Principal Component Analysis (PCA) Project the data onto an axis which preserves maximum variance PCA is good for representation but not for discrimination

LDA performs better, but is computationally expensive Features at Each Node Pairwise Linear Discriminant Analysis (LDA) is more effective Fischer Linear Discriminant, Optimal Discriminant Vectors Large number of feature extractions Large number of matrices to be stored LDA performs better, but is computationally expensive

Solution 1,4 4 1 3 2,4 1,3 2 3,4 2,3 1,2 4 3 2 1

Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 4 3 2 1

Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M23 4 3 2 1

Solution M14 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 4 3 2 1 M23 M12

Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 4 3 2 1 M23 M12 4 Classes 6 classifiers 6 Dimensionality Reductions Total number of features extracted : (N-1) * reduced_dimension

Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 4 3 2 1 M23 M12 Example : 400 classes and 400 features reduced to 50 Results in 399000 Projections overall, and 19950 for a single evaluation DDAG

LDA is effective, but highly complex Solution M14 M13 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M24 4 3 2 1 M23 M12 LDA is effective, but highly complex in space and time

Solution M14 M13 M34 1 4 3 2 M24 M23 M12

Solution M14 M13 M12 M34 M23 4 1 M34 3 M = 2 M13 M24 M14 M23 M24 M12 Stack all the transformations

This matrix is Rank Deficient Solution M14 M12 M13 M23 M34 1 M34 4 M = 3 M13 2 M14 M24 M24 M23 This matrix is Rank Deficient M12

Solution M14 M12 M24 M34 M23 1 4 3 M34 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Use a reduced representation

Solution M14 M24 M12 M34 M23 4 1 M34 3 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Has many similar rows Clustering, SVD etc., may be used

Remarks Only one time feature extraction Results in a reduced LDA matrix, retaining the discriminant capacity Direct PCA LDA Compressed LDA Dataset Compressed LDA Boosted Pendigits 95.99 97.24 Optdigits 98.34 98.92 Dataset Feat Acc Pendigits 16 95.77 13 95.46 225 95.99 11 Optdigits 64 97.9 41 98.63 36 98.34

Motivating Example 1,4 Priors : {0.3, 0.1, 0.2, 0.4} All Classifiers are 90% Correct 2,4 1,3 1,4 0.3*(0.9)3 + 0.1*(0.5)*(0.9) 2 + 0.2*(0.5)*(0.9) 2 + 0.4*(0.9)3 3,4 2,3 1,2 Reordering 2,4 1,3 2 1 4 3 3,4 2,3 1,2 Accuracy : 80.28 % Accuracy : 88.92 % 43.8% reduction in error !! 1 4 3 2

Formulation 1,4 Number of classes = N Priors = Pi 2,4 1,3 Errors = q (at each nodes) Relevant Path length = max (N – i, i – 1) 3,4 2,3 1,2 Number of relevant paths of length l to node r = Nrl 2 1 4 3 Prefer central positions in the list for high prior classes Optimal Maximize

Disadvantage of a DDAG DDAG can provide only a class label New DDAG classification protocol proposed Previous formulation is insufficient

Maximizing DDAG Accuracy 1,4 2,4 1,3 3,4 2,3 1,2 i j ……..

DDAG design is NP-Hard Optimal Decision Tree is NP-Hard DAG Design is reducible to Optimal Decision Tree Approximate algorithms are the only resort

Proposed Algorithms Three greedy algorithms Prefer high prior classifiers to be at center of the DDAG Prefer high performance classifiers to be the root nodes of the DDAG Prefer high error classes to be at the center of the DDAG Empirical results show that approximation error is close to half that of optimal graph

Complexities of Classification Classifier Space T - Best T - Worst T-Avg 1 vs Rest O(N) 1 vs 1 O(N2) DDAG O(N2) O(N) BHC O(N) O(1) O(log(N))

Binary Hierarchical Classifiers 1,4,5 vs 2,3 3 5 2 4 vs 1,5 2 vs 3 1 4 4 1vs5 3 2 1 5

Graph Partitioning We prefer Linear Cuts with large Margin 3 3 5 2 5 2 1 Root Node 1 4 4 1,4 vs 2,3,5 1,2,4,5 vs 3 Data Similarity Graph We prefer Linear Cuts with large Margin We prefer Linear Cuts None of the partitioning schemes are universally good for all problems (No Free Lunch Theorem) Objective : Compact Clusters Objective : Maximize the cut

Graph Partitioning Simple Workaround : Use locally best partitions 3 3 5 2 5 2 1 1 4 4 Graph Data Simple Workaround : Use locally best partitions

Margin Improvement Let some classes be there on both sides 3 3 5 Remove class 2 5 2 1 1 4 4 Improved Margin Margin Let some classes be there on both sides Don’t insist on mutually exclusive partitions

Trees with Overlapping Partitions 1,2 – 3 – 4,5,6 1,2 – 3 3,4 – 5 – 6 3 1,2 3,4 - 5 5,6 3,4 5 2 1 5 6 3 4

Comments The complexity remains O(log(N)) Different criterion for removing bad classes

Configurable Hybrid Classifiers DDAG : High Accuracy, Large Size BHC : Moderate Accuracy, Small Size Take advantages of both If “classification” is easy, use BHC, otherwise use a DDAG

Results on OCR datasets

Classifiability Use expected error to select appropriate classifiers How easy or difficult is it to classify a set of classes Computable from cooccurence matrices We proposed a pair wise classifiability measure Lpairwise = 2/N(N-1)∑ Lij

Generalization Capacity of Proposed Algorithms The probability of error that a classifier makes on unseen samples is called generalization Large Margin Better features in a DDAG Better partitions in a BHC Use classifier of required complexity at each step (Occam’s Razor) Efficient feature representations require less complex classifiers Simpler partitions in BHC require less complex classifiers Architecture level generalization Hybrid classifiers use architectures of required complexity at each node, thereby improving the generalization Empirically we have demonstrated the generalization of algorithms

Conclusions Formulation, Analysis and Algorithms are presented to design DDAGs using robust feature representations to design DDAGs using node-reordering to design Hierarchical classifiers with better generalization to design Hybrid hierarchical classifiers

Future Work Design based on simple algorithms may improve the current “high-performance” classifiers Promising directions Feature based partitioning vs Class based partitioning Trees with overlapping partitions Efficient DDAG design algorithms Configurability in classifier design

Thank You