Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen)
Dagstuhl Seminar, Outline Vector Quantization (VQ) Analysis of VQ Dynamics Learning Vector Quantization (LVQ) Summary
Dagstuhl Seminar, Vector Quantization Objective: representation of (many) data with (few) prototype vectors Assign data ξ μ to nearest prototype vector w j (by a distance measure, e.g. Euclidean) grouping data into clusters e.g. for classification data distance to nearest prototype Find optimal set W for lowest quantization error
Dagstuhl Seminar, Example: Winner Takes All (WTA) initialize K prototype vectors present a single example identify the closest prototype, i.e the so-called winner move the winner even closer towards the example stochastic gradient descent with respect to a cost function prototypes at areas with high density of data
Dagstuhl Seminar, Problems Winner Takes All “winner takes most”: update according to “rank” e.g. Neural Gas sensitive to initialization less sensitive to initialization?
Dagstuhl Seminar, (L)VQ algorithms intuitive fast, powerful algorithms flexible limited theoretical background w.r.t. convergence speed, robustness to initial conditions, etc. Analysis of VQ Dynamics exact mathematical description in very high dimensions study of typical learning behavior
Dagstuhl Seminar, Model: two Gaussian clusters of high dimensional data Random vectors ξ ∈ ℝ N according to prior prob.: p +, p - p + + p - = 1 B+B+ B-B- (p+)(p+) (p-)(p-) ℓ separable in projection to (B +, B - ) plane (p+)(p+) (p-)(p-) not separable on other planes cluster centers: B +, B - ∈ ℝ N variance: υ +, υ - separation ℓ only separable in 2 dimensions simple model, but not trivial classes: σ = {+1,-1}
Dagstuhl Seminar, sequence of independent random data learning rate, step size strength, direction of update etc. move prototype towards current data update of prototype vector prototype class data class “winner” f s […] describes the algorithm used Online learning w s ∈ ℝ N
Dagstuhl Seminar, projections to cluster centers length and overlap of prototypes 1. Define few characteristic quantities of the system random vector ξ μ enters as projections 2. Derive recursion relations of the quantities for new input data 3. Calculate average recursions
Dagstuhl Seminar, In the thermodynamic limit N ∞... average over examples characteristic quantities self average w.r.t. random sequence of data (fluctuations vanish) the projections become correlated Gaussian quantities completely specified in terms of first and second moments: define continuous learning time μ : discrete (1,2,…,P) t : continuous
Dagstuhl Seminar, Derive ordinary differential equations 5. Solve for R sσ (t), Q st (t) dynamics/asymptotic behavior (t ∞) quantization/generalization error sensitivity to initial conditions, learning rates, structure of data
Dagstuhl Seminar, Q 11 Q 22 Q 12 Results VQ 2 prototypes R 1+ R 2- R 2+ R 1- w s winner Numerical integration of the ODEs ( w s (0)≈0 p + =0.6, ℓ=1.0, υ + =1.5, υ - =1.0, =0.01) E(W) t characteristic quantities quantization error
Dagstuhl Seminar, B+B+ B-B- ℓ R S+ R S- 2 prototypes Projections of prototypes on the B+,B- plane at t=50 R S+ R S- p + > p - Two prototypes move to the stronger cluster 3 prototypes
Dagstuhl Seminar, Neural Gas: a winner take most algorithm 3 prototypes update strength decreases exponentially by rank R S+ R S- quantization error E(W) t λ i =2; λ f =10 -2 λ(t) large initially, decreased over time λ(t) 0: identical to WTA t=0 t=50
Dagstuhl Seminar, Sensitivity to initialization at t=50 t=0 Neural GasWTA at t=50 R S+ R S- R S+ R S- Neural Gas: more robust w.r.t. initialization WTA: (eventually) reaches minimum E(W) depends on initialization: possible large learning time E(W) t “plateau” ∇ H VQ ≈0
Dagstuhl Seminar, Learning Vector Quantization (LVQ) Objective: classification of data using prototype vectors Find optimal set W for lowest generalization error misclassified by nearest prototype Assign data {ξ,σ}; ξ ∈ ℝ N to nearest prototype vector (distance measure, e.g. Euclidean)
Dagstuhl Seminar, w s winner ±1 LVQ1 c={+1, -1} R S+ R S- two prototypes c={+1,+1,-1} R S+ R S- three prototypes c={+1,-1,-1} R S+ R S- which class to add the 3 rd prototype? update winner towards/ away from data no cost function related to generalization error
Dagstuhl Seminar, Generalization error p + =0.6, p - = 0.4 υ + =1.5, υ - =1.0 εgεg t class misclassified data
Dagstuhl Seminar, Optimal decision boundary B+B+ B-B- (p + >p - ) (p - ) ℓ d equal variance (υ + =υ - ): linear decision boundary unequal variance υ + >υ - K=2 optimal with K=3 more prototypes better approximation to optimal decision boundary (hyper)plane where
Dagstuhl Seminar, Asymptotic ε g p+p+ Optimal: K=3 better LVQ1: K=3 better best: more prototypes on the class with the larger variance more prototypes not always better for LVQ1 υ + >υ - (υ + =0.81, υ - =0.25) ε g (t ∞ ) c={+1,+1,-1} p+p+ εgεg c={+1,-1,-1} Optimal: K=3 equal to K=2 LVQ1: K=3 worse ε g (t ∞ )
Dagstuhl Seminar, Summary dynamics of (Learning) Vector Quantization for high dimensional data Neural Gas: more robust w.r.t. initialization than WTA LVQ1: more prototypes not always better Outlook study different algorithms e.g. LVQ+/-, LFM, RSLVQ more complex models multi-prototype, multi-class problems Reference Dynamics and Generalization Ability of LVQ Algorithms M. Biehl, A. Ghosh, and B. Hammer Journal of Machine Learning Research (8): (2007)
Questions ?
Dagstuhl Seminar,
identify the closest prototype, i.e the winner initialize K prototype vectors with classes ς move the winner towards/away from the example if prototype class is correct/wrong correct class example: LVQ1 present a single example incorrect class
Dagstuhl Seminar, Central Limit Theorem Let x 1, x 2,…, x N be independent random numbers from arbitrary probability distribution with mean and finite variance The distribution of the average of x j approaches a normal distribution as N becomes large. N=1 N=2 N=5N=50 Example: non- normal distribution Distribution of average of x j : p(x j )
Dagstuhl Seminar, Self Averaging Monte Carlo simulations over 100 independent runs Fluctuations decreases with larger degree of freedom N At N ∞, fluctuations vanish (variance becomes zero)
Dagstuhl Seminar, “LVQ +/-” d s = min {d k } with c s = σ μ update correct and incorrect winners d t = min {d k } with c t ≠σ μ t t strongly divergent! p + >> p - : strong repulsion by stronger class to overcome divergence: e.g. early stopping (difficult in practice) stop at ε g (t)=ε g,min ε g (t)
Dagstuhl Seminar, Comparison LVQ1 and LVQ +/- υ + = υ - =1.0 LVQ1 outperforms LVQ+/- with early stopping c={+1,+1,-1} p+p+ υ + = 0.81, υ - =0.25 LVQ+/- with early stopping outperforms LVQ1 in a certain p + interval p+p+ LVQ+/- performance depends on initial conditions