Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Introduction to Neural Networks Computing
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Monte Carlo Methods and Statistical Physics
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ)
CHAPTER 16 MARKOV CHAIN MONTE CARLO
CF-3 Bank Hapoalim Jun-2001 Zvi Wiener Computational Finance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Experimental Evaluation
Monte Carlo Methods in Partial Differential Equations.
Lecture II-2: Probability Review
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Simulation Output Analysis
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
The Dynamics of Learning Vector Quantization, RUG, The Dynamics of Learning Vector Quantization Rijksuniversiteit Groningen Mathematics and.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Classification of boar sperm head images using Learning Vector Quantization Rijksuniversiteit Groningen/ NL Mathematics and Computing Science
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
Workshop on Stochastic Differential Equations and Statistical Inference for Markov Processes Day 3: Numerical Methods for Stochastic Differential.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Time-dependent Schrodinger Equation Numerical solution of the time-independent equation is straightforward constant energy solutions do not require us.
Dynamical Analysis of LVQ type algorithms, WSOM 2005 Dynamical analysis of LVQ type learning rules Rijksuniversiteit Groningen Mathematics and Computing.
Lecture 4 Linear machine
Chapter 12 Object Recognition Chapter 12 Object Recognition 12.1 Patterns and pattern classes Definition of a pattern class:a family of patterns that share.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
EEE502 Pattern Recognition
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
CHAPTER 14 Competitive Networks Ming-Feng Yeh.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
LECTURE 11: Advanced Discriminant Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining K-means Algorithm
CH 5: Multivariate Methods
Sample Mean Distributions
Basic machine learning background with Python scikit-learn
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Hidden Markov Models Part 2: Algorithms
Lecture 2 – Monte Carlo method in finance
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Perceptron as one Type of Linear Discriminants
Summarizing Data by Statistics
Foundation of Video Coding Part II: Scalar and Vector Quantization
Topological Signatures For Fast Mobility Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen)

Dagstuhl Seminar, Outline  Vector Quantization (VQ)  Analysis of VQ Dynamics  Learning Vector Quantization (LVQ)  Summary

Dagstuhl Seminar, Vector Quantization Objective: representation of (many) data with (few) prototype vectors Assign data ξ μ to nearest prototype vector w j (by a distance measure, e.g. Euclidean) grouping data into clusters e.g. for classification data distance to nearest prototype Find optimal set W for lowest quantization error

Dagstuhl Seminar, Example: Winner Takes All (WTA) initialize K prototype vectors present a single example identify the closest prototype, i.e the so-called winner move the winner even closer towards the example stochastic gradient descent with respect to a cost function prototypes at areas with high density of data

Dagstuhl Seminar, Problems Winner Takes All “winner takes most”: update according to “rank” e.g. Neural Gas sensitive to initialization less sensitive to initialization?

Dagstuhl Seminar, (L)VQ algorithms intuitive fast, powerful algorithms flexible limited theoretical background w.r.t. convergence speed, robustness to initial conditions, etc. Analysis of VQ Dynamics exact mathematical description in very high dimensions study of typical learning behavior

Dagstuhl Seminar, Model: two Gaussian clusters of high dimensional data Random vectors ξ ∈ ℝ N according to prior prob.: p +, p - p + + p - = 1 B+B+ B-B- (p+)(p+) (p-)(p-) ℓ separable in projection to (B +, B - ) plane (p+)(p+) (p-)(p-) not separable on other planes cluster centers: B +, B - ∈ ℝ N variance: υ +, υ - separation ℓ only separable in 2 dimensions  simple model, but not trivial classes: σ = {+1,-1}

Dagstuhl Seminar, sequence of independent random data learning rate, step size strength, direction of update etc. move prototype towards current data update of prototype vector prototype class data class “winner” f s […] describes the algorithm used Online learning w s ∈ ℝ N

Dagstuhl Seminar, projections to cluster centers length and overlap of prototypes 1. Define few characteristic quantities of the system random vector ξ μ enters as projections 2. Derive recursion relations of the quantities for new input data 3. Calculate average recursions

Dagstuhl Seminar, In the thermodynamic limit N  ∞... average over examples characteristic quantities self average w.r.t. random sequence of data (fluctuations vanish) the projections become correlated Gaussian quantities  completely specified in terms of first and second moments: define continuous learning time μ : discrete (1,2,…,P) t : continuous

Dagstuhl Seminar, Derive ordinary differential equations 5. Solve for R sσ (t), Q st (t) dynamics/asymptotic behavior (t  ∞) quantization/generalization error sensitivity to initial conditions, learning rates, structure of data

Dagstuhl Seminar, Q 11 Q 22 Q 12 Results VQ 2 prototypes R 1+ R 2- R 2+ R 1- w s winner Numerical integration of the ODEs ( w s (0)≈0 p + =0.6, ℓ=1.0, υ + =1.5, υ - =1.0,  =0.01) E(W) t characteristic quantities quantization error

Dagstuhl Seminar, B+B+ B-B- ℓ R S+ R S- 2 prototypes Projections of prototypes on the B+,B- plane at t=50 R S+ R S- p + > p - Two prototypes move to the stronger cluster 3 prototypes

Dagstuhl Seminar, Neural Gas: a winner take most algorithm 3 prototypes update strength decreases exponentially by rank R S+ R S- quantization error E(W) t λ i =2; λ f =10 -2 λ(t) large initially, decreased over time λ(t)  0: identical to WTA t=0 t=50

Dagstuhl Seminar, Sensitivity to initialization at t=50 t=0 Neural GasWTA at t=50 R S+ R S- R S+ R S- Neural Gas: more robust w.r.t. initialization WTA: (eventually) reaches minimum E(W) depends on initialization: possible large learning time E(W) t “plateau” ∇ H VQ ≈0

Dagstuhl Seminar, Learning Vector Quantization (LVQ) Objective: classification of data using prototype vectors    Find optimal set W for lowest generalization error misclassified by nearest prototype Assign data {ξ,σ}; ξ ∈ ℝ N to nearest prototype vector (distance measure, e.g. Euclidean)

Dagstuhl Seminar, w s winner ±1 LVQ1 c={+1, -1} R S+ R S- two prototypes c={+1,+1,-1} R S+ R S- three prototypes c={+1,-1,-1} R S+ R S- which class to add the 3 rd prototype? update winner towards/ away from data no cost function related to generalization error

Dagstuhl Seminar, Generalization error p + =0.6, p - = 0.4 υ + =1.5, υ - =1.0 εgεg t class misclassified data

Dagstuhl Seminar, Optimal decision boundary B+B+ B-B- (p + >p - ) (p - ) ℓ d equal variance (υ + =υ - ): linear decision boundary unequal variance υ + >υ - K=2 optimal with K=3 more prototypes  better approximation to optimal decision boundary (hyper)plane where

Dagstuhl Seminar, Asymptotic ε g p+p+ Optimal: K=3 better LVQ1: K=3 better best: more prototypes on the class with the larger variance more prototypes not always better for LVQ1 υ + >υ - (υ + =0.81, υ - =0.25) ε g (t  ∞ ) c={+1,+1,-1} p+p+ εgεg c={+1,-1,-1} Optimal: K=3 equal to K=2 LVQ1: K=3 worse ε g (t  ∞ )

Dagstuhl Seminar, Summary  dynamics of (Learning) Vector Quantization for high dimensional data  Neural Gas: more robust w.r.t. initialization than WTA  LVQ1: more prototypes not always better Outlook  study different algorithms e.g. LVQ+/-, LFM, RSLVQ  more complex models  multi-prototype, multi-class problems Reference Dynamics and Generalization Ability of LVQ Algorithms M. Biehl, A. Ghosh, and B. Hammer Journal of Machine Learning Research (8): (2007)

Questions ?

Dagstuhl Seminar,

identify the closest prototype, i.e the winner initialize K prototype vectors with classes ς move the winner towards/away from the example if prototype class is correct/wrong correct class example: LVQ1 present a single example incorrect class

Dagstuhl Seminar, Central Limit Theorem Let x 1, x 2,…, x N be independent random numbers from arbitrary probability distribution with mean and finite variance The distribution of the average of x j approaches a normal distribution as N becomes large. N=1 N=2 N=5N=50 Example: non- normal distribution Distribution of average of x j : p(x j )

Dagstuhl Seminar, Self Averaging Monte Carlo simulations over 100 independent runs Fluctuations decreases with larger degree of freedom N At N  ∞, fluctuations vanish (variance becomes zero)

Dagstuhl Seminar, “LVQ +/-” d s = min {d k } with c s = σ μ update correct and incorrect winners d t = min {d k } with c t ≠σ μ t t strongly divergent! p + >> p - : strong repulsion by stronger class to overcome divergence: e.g. early stopping (difficult in practice) stop at ε g (t)=ε g,min ε g (t)

Dagstuhl Seminar, Comparison LVQ1 and LVQ +/- υ + = υ - =1.0 LVQ1 outperforms LVQ+/- with early stopping c={+1,+1,-1} p+p+ υ + = 0.81, υ - =0.25 LVQ+/- with early stopping outperforms LVQ1 in a certain p + interval p+p+ LVQ+/-  performance depends on initial conditions