Prototype-based models

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
ECG Signal processing (2)
Aggregating local image descriptors into compact codes
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
Support Vector Machines
Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.
Self Organization: Competitive Learning
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM). Unsupervised neural networks, equivalent to clustering. Two layers – input and output – The input layer represents the input.
Chapter 4: Linear Models for Classification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Radial Basis Function Networks
CSE 185 Introduction to Computer Vision Pattern Recognition.
The Dynamics of Learning Vector Quantization, RUG, The Dynamics of Learning Vector Quantization Rijksuniversiteit Groningen Mathematics and.
This week: overview on pattern recognition (related to machine learning)
Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Classification of boar sperm head images using Learning Vector Quantization Rijksuniversiteit Groningen/ NL Mathematics and Computing Science
A two-stage approach for multi- objective decision making with applications to system reliability optimization Zhaojun Li, Haitao Liao, David W. Coit Reliability.
An Introduction to Support Vector Machines (M. Law)
Stabil07 03/10/ Michael Biehl Intelligent Systems Group University of Groningen Rainer Breitling, Yang Li Groningen Bioinformatics Centre Analysis.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
IE 585 Competitive Network – Learning Vector Quantization & Counterpropagation.
UNSUPERVISED LEARNING NETWORKS
Dynamical Analysis of LVQ type algorithms, WSOM 2005 Dynamical analysis of LVQ type learning rules Rijksuniversiteit Groningen Mathematics and Computing.
Prototype-based learning and adaptive distances for classification
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Classification of FDG-PET* Brain Data
Chapter 5 Unsupervised learning
Self-Organizing Network Model (SOM) Session 11
Semi-Supervised Clustering
CEE 6410 Water Resources Systems Analysis
Predicting Recurrence in Clear Cell Renal Cell Carcinoma
Action-Grounded Push Affordance Bootstrapping of Unknown Objects
LECTURE 11: Advanced Discriminant Analysis
Data Mining, Neural Network and Genetic Programming
Radial Basis Function G.Anuradha.
Prototype-based models in unsupervised and supervised machine learning
The Elements of Statistical Learning
Machine Learning Basics
Clustering (3) Center-based algorithms Fuzzy k-means
Lecture 22 Clustering (3).
Probabilistic Models with Latent Variables
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Neuro-Computing Lecture 4 Radial Basis Function Network
COSC 4335: Other Classification Techniques
Clustering Wei Wang.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Nearest Neighbors CSC 576: Data Mining.
Introduction to Cluster Analysis
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Topological Signatures For Fast Mobility Analysis
Feature mapping: Self-organizing Maps
Linear Discrimination
SVMs for Document Ranking
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Prototype-based models in machine learning Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen www.cs.rug.nl/biehl

review: WIRES Cognitive Science (2016)

overview Introduction / Motivation prototypes, exemplars neural activation / learning 2. Unsupervised Learning Vector Quantization (VQ) Competitive Learning in VQ and Neural Gas Kohonen’s Self-Organizing Map (SOM) 3. Supervised Learning Learning Vector Quantization (LVQ) Adaptive distances and Relevance Learning (Examples: three bio-medical applications) 4. Summary

1. Introduction prototypes, exemplars: representation of information in terms of typical representatives (e.g. of a class of objects), much debated concept in cognitive psychology neural activation / learning: external stimulus to a network of neurons response acc. to weights (expected inputs) best matching unit (and neighbors) learning -> even stronger response to the same stimulus in future weights represent different expected stimuli (prototypes)

even independent from the above: attractive framework for machine learning based data analysis trained system is parameterized in the feature space (data) facilitates discussions with domain experts transparent (white box) and provides insights into the applied criteria (classification, regression, clustering etc.) easy to implement, efficient computation versatile, successfully applied in many different application areas

2. Unsupervised Learning Some potential aims: dimension reduction: - compression - visualization for human insight - principal {independent} component analysis exploration / structure detection: - clustering - similarities / dissimilarities - source identification - density estimation - neighborhood relation, topology pre-processing for further analysis - supvervised learning, e.g. classification, regression, prediction

Vector Quantization (VQ) Vector Quantization: identify (few) typical representatives of data which capture essential features VQ system: set of prototypes data: set of feature vectors based on dis-similarity/distance measure assignment to prototypes: given vector xμ , determine winner → assign xμ to prototype w* one popular example: (squared) Euclidean distance

competitive learning initially: randomized wk, e.g. in randomly selected data points random sequential (repeated) presentation of data … the winner takes it all: η (<1): learning rate, step size of update comparison: K-means: updates all prototypes, considers all data at a time, EM for Gaussian mixtures in the limit of zero width competitive VQ: updates only the winner, random sequ. presentation of single examples (stochastic gradient descent)

quantization error competitive VQ (and K-means) aim at optimizing a cost function: here: Euclidean distance - assign each data to closest prototype - measure the corresponding (squared) distance quantization error (sum over all data points) measures the quality of the representation defines a (one) criterion to evaluate / compare the quality of different prototype configurations

VQ and clustering Remark 1: VQ ≠ clustering minimal quantization error: in general: representation of observations in feature space ideal clustering scenario: well-separate, spherical clusters sensitive to cluster shape, coordinate transformations (even linear) small clusters irrelevant with respect to quan- tization error

VQ and clustering ??? Remark 2: clustering is an ill-defined problem “well, maybe only two?” “obviously three clusters” our criterion: lower HVQ higher HVQ → “ better clustering ” ???

VQ and clustering K=1 K=60 HVQ = 0 → “ the best clustering ” ? the simplest clustering … HVQ (and similar criteria) allow only to compare VQ with the same K ! more general: heuristic compromise between “error” and “simplicity”

competitive learning practical issues of VQ training: data initial dead units training solution: rank-based updates (winner, second, third,… ) initial prototypes more general: local minima of the quantization error, initialization-dependent outcome of training

Neural Gas (NG) many prototypes (gas) to represent the density of observed data [Martinetz, Berkovich, Schulten, IEEE Trans. Neural Netw. 1993] introduce rank-based neighborhood cooperativeness: upon presentation of xμ : determine the rank of the prototypes update all prototypes: with neighborhood function and rank-based range λ potential annealing of λ from large to smaller values

Self-Organizing Map T. Kohonen. Self-Organizing Maps. Springer (2nd edition 1997) neighborhood cooperativeness on a predefined low-dim. lattice upon presentation of xμ : determine the winner (best matching unit) at position s in the lattice lattice A of neurons i.e. prototypes update winner and neighborhood: where range ρ w.r.t. distances in lattice A

Self-Organizing Map © Wikipedia lattice deforms reflecting the density of observation SOM provides topology preserving low-dim representation e.g. for inspection and visualization of structured datasets

Self-Organizing Map illustration: Iris flower data set [Fisher, 1936]: 4 num. features representing Iris flowers from 3 different species SOM (4x6 prototypes in a 2-dim. grid) training on 150 samples (without class label information) component planes: 4 arrays representing the prototype values

Self-Organizing Map U-Matrix: elements Ur = average distance d(wr,ws) from n.n. sites post labelling: assign prototype to the majority class of data it wins Versicolor Setosa Virginica (undefined) reflects cluster structure larger U at cluster borders here: Setosa well separated from Virginica/Versicolor

Vector Quantization Remarks: - presentation of approaches not in historical order - many extensions of the basic concept, e.g. Generative Topographic Map (GTM), probabilistic formulation of the mapping to low-dim. lattice [Bishop, Svensen, Williams, 1998] SOM and NG for specific types of data - time series - “non-vectorial” relational data - graphs and trees

3. Supervised Learning Potential aims: - classification: assign observations (data) to categories or classes as inferred from labeled training data - regression: assign a continuous target value to an observation dto. - prediction: predict the evolution of a time series (sequence) inferred from observations of the history

distance based classification assignment of data (objects, observations,...) to one or several classes (crisp/soft) (categories, labels) based on comparison with reference data (samples, prototypes) in terms of a distance measure (dis-similarity, metric) representation of data (a key step!) - collection of qualitative/quantitative descriptors - vectors of numerical features - sequences, graphs, functional data - relational data, e.g. in terms of pairwise (dis-) similarities

K-NN classifier a simple distance-based classifier store a set of labeled examples classify a query according to the label of the Nearest Neighbor (or the majority of K NN) ? local decision boundary acc. to (e.g.) Euclidean distances piece-wise linear class borders parameterized by all examples feature space + conceptually simple, no training required, one parameter (K) - expensive storage and computation, sensitivity to “outliers” can result in overly complex decision boundaries

prototype based classification a prototype based classifier [Kohonen 1990, 1997] represent the data by one or several prototypes per class classify a query according to the label of the nearest prototype (or alternative schemes) ? local decision boundaries according to (e.g.) Euclidean distances piece-wise linear class borders parameterized by prototypes feature space + less sensitive to outliers, lower storage needs, little computational effort in the working phase - training phase required in order to place prototypes, model selection problem: number of prototypes per class, etc.

Nearest Prototype Classifier set of prototypes carrying class-labels nearest prototype classifier (NPC): based on dissimilarity/distance measure given - determine the winner - assign x to the class reasonable requirements: most prominent example: (squared) Euclidean distance

Learning Vector Quantization N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) heuristic scheme: LVQ1 [Kohonen, 1990, 1997] • initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) • move the winner - closer towards the data (same class) - away from the data (different class)

  Learning Vector Quantization ∙ aim: discrimination of classes 5/26/2018 Learning Vector Quantization N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) ∙ distance-based classification [here: Euclidean distances] ∙ tesselation of feature space [piece-wise linear]   ∙ aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙ generalization ability correct classification of new data

LVQ1 iterative training procedure: randomized initial , e.g. close to the class-conditional means sequential presentation of labelled examples … the winner takes it all: LVQ1 update step: learning rate many heuristic variants/modifications: [Kohonen, 1990,1997] learning rate schedules ηw (t) update more than one prototype per step

LVQ1 LVQ1 update step: LVQ1-like update for generalized distance: requirement: update decreases (increases) distance if classes coincide (are different)

5/26/2018 Generalized LVQ one example of cost function based training: GLVQ [Sato & Yamada, 1995] minimize two winning prototypes: linear E favors large margin separation of classes, e.g. sigmoidal (linear for small arguments), e.g. E approximates number of misclassifications

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on Euclidean distance moves prototypes towards / away from sample with prefactors

prototype/distance based classifiers + intuitive interpretation prototypes defined in feature space + natural for multi-class problems + flexible, easy to implement + frequently applied in a variety of practical problems often based on purely heuristic arguments … or … cost functions with unclear relation to classification error model/parameter selection (# of prototypes, learning rate, …) Important issue: which is the ‘right’ distance measure ? features may - scale differently be of completely different nature - be highly correlated / dependent … simple Euclidean distance ?

distance measures fixed distance measures: - select distance measures according to prior knowledge - data driven choice in a preprocessing step - determine prototypes for a given distance - compare performance of various measures example: divergence based LVQ

Relevance Matrix LVQ [Schneider et al., 2009] generalized quadratic distance in LVQ: normalization: variants: one global, several local, class-wise relevance matrices → piecewise quadratic decision boundaries diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002] rectangular discriminative low-dim. representation e.g. for visualization [Bunte et al., 2012] possible constraints: rank-control, sparsity, …

Generalized Relevance Matrix LVQ optimization of prototypes and distance measure Generalized Matrix-LVQ (GMLVQ) gradients of cost function:

heuristic interpretation standard Euclidean distance for linearly transformed features summarizes - the contribution of the original dimension - the relevance of original features for the classification interpretation assumes implicitly: features have equal order of magnitude e.g. after z-score-transformation → (averages over data set)

Relevance Matrix LVQ Iris flower data revisited (supervised analysis by GMLVQ) relevance matrix GMLVQ prototypes

Relevance Matrix LVQ empirical observation / theory: relevance matrix becomes singular, dominated by very few eigenvectors prevents over-fitting in high-dim. feature spaces facilitates discriminative visualization of datasets confirms: Setosa well-separated from Virginica / Versicolor

a multi-class example classification of coffee samples projection on second eigenvector classification of coffee samples based on hyperspectral data (256-dim. feature vectors) [U. Seiffert et al., IFF Magdeburg] projection on first eigenvector prototypes

Relevance Matrix LVQ simplified classification schemes optimization of prototype positions distance measure(s) in one training process (≠ pre-processing) motivation: improved performance - weighting of features and pairs of features simplified classification schemes - elimination of non-informative, noisy features - discriminative low-dimensional representation insight into the data / classification problem - identification of most discriminative features - intrinsic low-dim. representation, visualization

related schemes Linear Discriminant Analysis (LDA) 5/26/2018 related schemes Linear Discriminant Analysis (LDA) one prototype per class + global matrix, different objective function! Relevance Learning related schemes in supervised learning ... RBF Networks [Backhaus et al., 2012] Neighborhood Component Analysis [Goldberger et al., 2005] Large Margin Nearest Neighbor [Weinberger et al., 2006, 2010] and many more! Relevance LVQ variants local, rectangular, structured, restricted... relevance matrices for visualization, functional data, texture recognition, etc. relevance learning in Robust Soft LVQ, Supervised NG, etc. combination of distances for mixed data ...

links Matlab code: Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and LiRaM LVQ): http://matlabserver.cs.rug.nl/gmlvqweb/web/ A no-nonsense beginners’ tool for GMLVQ: http://www.cs.rug.nl/~biehl/gmlvq (see also: Tutorial, Thursday 9:30) Pre- and re-prints etc.: http://www.cs.rug.nl/~biehl/

Questions ? ?