Thesis title: “Studies in Pattern Classification – Biological Modeling, Uncertainty Reasoning, and Statistical Learning” 3 parts: (1)Handwritten Digit Recognition with a Vision- Based Model (part in CVPR-2000) (2)An Uncertainty Framework for Classification (UAI-2000) (3)Selection of Support Vector Kernel Parameters (ICML-2000)
Handwritten Digit Recognition with a Vision-Based Model Loo-Nin Teow & Kia-Fock Loe School of Computing National University of Singapore
OBJECTIVE To develop a vision-based system that extracts features for handwritten digit recognition based on the following principles: –Biological Basis; –Linear Separability; –Clear Semantics.
Developing the model 2 main modules: Feature extractor –generates feature vector from raw pixel map. Trainable classifier –outputs the class based on the feature vector.
General System Structure Handwritten Digit Recognizer Feature Extractor Feature Classifier Raw Pixel Map Feature Vector Digit Class
The Biological Visual System Primary Visual Cortex Eye Optic nerve Optic tract Optic chiasm Brain Lateral geniculate nucleus Optic radiation
Receptive Fields Visual map Visual cell Receptive field input Output activations
Simple Cell Receptive Fields
Simple Cell Responses Cases with activation Cases without activation
Hypercomplex Receptive Fields
Hypercomplex Cell Responses Cases without activation Cases with activation
Biological Vision Local spatial features; Edge and corner orientations; Dual-channel (bright/dark; on/off); Non-hierarchical feature extraction.
The Feature Extraction Process Selective Feature Convolution Aggregation I Q F I I Q F 2 of 36x36 32 of 32x32 32 of 9x9
Dual Channel On-Channel intensity-normalize (Image) Off-Channel complement (On-Channel)
Selective Convolution Local receptive fields –same spatial features at different locations. Truncated linear halfwave rectification –strength of feature’s presence. “Soft” selection based on central pixel –reduce false edges and corners.
Selective Convolution (formulae) where
Convolution Mask Templates Simplified models of the simple and hypercomplex receptive fields. Detect edges and end-stops of various orientations. Corners - more robust than edges –On-channel end-stops : convex corners; –Off-channel end-stops : concave corners.
Some representatives of the 16 mask templates used in the feature extraction
Feature Aggregation Similar to subsampling: –reduces number of features; –reduces dependency on features’ positions; –local invariance to distortions and translations. Different from subsampling: –magnitude-weighted averaging; –detects presence of feature in window; –large window overlap.
Feature Aggregation (formulae) Magnitude-Weighted Average where
Classification Linear discrimination systems –Single-layer Perceptron Network minimize cross-entropy cost function. –Linear Support Vector Machines maximize interclass margin width. k-nearest neighbor –Euclidean distance –Cosine Similarity x x x x x x o o o o o o x x x x x x o o o o o o
Multiclass Classification Schemes for linear discrimination systems One-per-class (1 vs 9) Pairwise (1 vs 1) Triowise (1 vs 2)
Experiments MNIST database of handwritten digits training, testing. 36x36 input image. 32 9x9 feature maps.
Preliminary Experiments Feature Classifier SchemeVoting Option Train Error (%) (60000 samples) Test Error (%) (10000 samples) Perceptron Network 1-per-class PairwiseHard Soft TriowiseHard Soft Linear SVMs PairwiseHard Soft TriowiseHard Soft k-Nearest Neighbor Euclidean Distance (k = 3) Cosine Similarity (k = 3)
Experiments on Deslanted Images Feature Classifier SchemeVoting Option Train Error (%) (60000 samples) Test Error (%) (10000 samples) Perceptron Network PairwiseHard Soft TriowiseHard Soft Linear SVMs PairwiseHard Soft TriowiseHard Soft0.00 * 0.59 *
Misclassified Characters
Comparison with Other Models Classifier ModelTest Error (%) LeNet LeNet-4, boosted [distort]0.70 LeNet LeNet-5 [distort]0.80 Tangent distance1.10 Virtual SVM0.80 [deslant]* 0.59 *
Conclusion Our model extracts features that are –biologically plausible; –linearly separable; –semantically clear. Needs only a linear classifier –relatively simple structure; –trains fast; –gives excellent classification performance.
Hierarchy of Features? Idea originated from Hubel & Wiesel –LGN simple complex hypercomplex –later studies show these to be parallel. Hierarchy - too many feature combinations. Simpler to have only one convolution layer.
Linear Discrimination Output: where f defines a hyperplane: and g is the activation function: or
One-per-class Classification the unit with the largest output value indicates the class of the character:
Pairwise Classification Soft Voting: Hard Voting: where
Triowise Classification Soft Voting: Hard Voting:
k-Nearest Neighbor Euclidean Distance Cosine Similarity where
Confusion Matrix (triowise SVMs / soft voting / deslanted)
Number of iterations to convergence for the perceptron network Scheme#units# epochs 1-per-class Pairwise Triowise360147