Incremental Methods for Machine Learning Problems Aristidis Likas Department of Computer Science University of Ioannina

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Part 2: Unsupervised Learning
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Biointelligence Laboratory, Seoul National University
Support Vector Machines
Machine learning continued Image source:
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Segmentation and Fitting Using Probabilistic Methods
Computer vision: models, learning and inference
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Radial Basis Functions
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Neural Networks Marco Loog.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Supervised Learning Networks. Linear perceptron networks Multi-layer perceptrons Mixture of experts Decision-based neural networks Hierarchical neural.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial-Basis Function Networks
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
An Introduction to Support Vector Machines Martin Law.
Radial Basis Function Networks
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Artificial Neural Networks Shreekanth Mandayam Robi Polikar …… …... … net k   
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Neural networks and support vector machines
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
Computer vision: models, learning and inference
Machine learning, pattern recognition and statistical data modelling
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Probabilistic Models with Latent Variables
SMEM Algorithm for Mixture Models
Neuro-Computing Lecture 4 Radial Basis Function Network
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Presentation transcript:

Incremental Methods for Machine Learning Problems Aristidis Likas Department of Computer Science University of Ioannina

Outline Machine Learning: Data Modeling + Optimization The incremental machine learning framework Global k-means (PR, IEEE TNN) Greedy EM (NPL 2002, Bioinformatics) Incremental Bayesian GMM learning (IEEE TNN) Dip-Means Incremental Bayesian Supervised learning (IEEE TNN) Current research problems – Matlab code available for all methods

Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression Also considered as Data mining or Pattern Recognition problems

Machine Learning as Optimization To solve a machine learning problem dataset X of training examples parametric Data Model that ‘explains’ the data – f(x; Θ ), Θ set of parameters to be estimated during training objective function L(X; Θ ) Model training is achieved through the optimization of the objective function. Usually non-convex optimization, many local optima We search for a ‘near-optimal’ solution

Machine Learning as Optimization Local search algorithms (gradient descent, BFGS, EM, k-means) Performance depends on the initialization of parameters. Typical solution: multiple (random) restarts – multiple local search runs from (random) initializations – Keep the solution of the best run Weakenesses: – poor solutions for large models – How many runs? – How to initialize? – non-determinism: non-repeatability, difficulty in comparing different methods. An alternative approach (in some cases): – incremental model training

Building Blocks formulation Many popular Data Models can be written as a combination (or simply as a set) of “Building Blocks” – Number of BBs = model order The combination function may also include parameters (w 1,…, w M ) Set of model parameters: Examples – k-means Clustering: Β=cluster centers, L=clustering error – Mixture Models: B=component densities, L=Likelihood – FF Neural Networks: B=sigmoidal or RBF hidden units, L=LS error – Kernel Models: B=basis functions (kernels), L=loss functions

Building Blocks In some models building blocks are fixed a priori. – Only optimization w.r.t to the combination weights w i is required (convex problem in many cases, eg SVM). In the general case all the BB parameters θ i should be learnt. Non-convex optimization problem – many local optima – local search methods – dependence on initialization of Θ M Resort to incremental training

Incremental training The incremental (greedy) approach can offer a simple and effective solution to the random restarts problem in training ML models. Incremental methods are based on the following assumption: – We can obtain a ‘near-optimal’ model with k BBs by exploiting a ‘near-optimal’ model with (k-1) BBs. Method: Starting with k=1 BB, incremental methods sequentially add one BB at each step until M BBs have been added.

Incremental Training Approaches 1. Fast approach: optimize only wrt θ k of the k-ΒΒ keeping θ 1 …θ k-1 fixed to the solution of (k-1)-BB model. – Exhaustive Enumeration (deterministic) – Multiple restarts, but the search space is much smaller 2. Fast approach followed by full model training (once) LS

Incremental Training 3. Full model training with multiple restarts: – Initializations based on the (k-1)-BB model. Deterministic search is preferable (avoid randomness) Incremental methods also offer solutions for all indermediate models with k=1,…,M BBs LS

Prototype-Based Clustering Partition a dataset X of N vectors x i into M subsets (clusters) C k such that intra-cluster variance is minimized. Intra-cluster variance: avg. distance from the cluster prototype m k k-means: Prototype = cluster center Finds local minima w.r.t. clustering error – sum of intra-cluster variances Highly dependent on the initial positions of the centers m k

Global k-means Incremental, deterministic clustering algorithm that runs k-Means several times Finds near-optimal solutions wrt clustering error Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state – the k-1 centers are initialized from a near-optimal solution of the (k- 1)-clustering problem – the k-th center is initialized at some data point x n (which?) Consider all possible initializations (one for each x n )

Global k-means In order to solve the M-clustering problem: – Solve the 1-clustering problem (trivial) – Solve the k-clustering problem using the solution of the (k-1)-clustering problem Execute k-Means N times, initialized as at the n-th run (n=1,…,N). Keep the solution corresponding to the run with the lowest clustering error as the solution with k clusters – k:=k+1, Repeat step 2 until k=M.

Best Initial m 2 Best Initial m 3 Best Initial m 4 Best Initial m 5

Fast Global k-Means How is the complexity reduced? – We select the initial state that provides the greatest reduction in clustering error in the first iteration of k-means (reduction can be computed analytically) – k-means is executed only once from this state

Kernel-Based Clustering (non-linear separation) – Given a set of objects and the kernel matrix K=[K ij ] containing the similarities between each pair of objects – Goal: Partition the dataset into subsets (clusters) C k such that intra-cluster similarity is maximized. – Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x). – The kernel function corresponds to the inner product in feature space – Kernel k-Means ≡ k-Means in feature space

Kernel k-Means Kernel k-means = k-means in feature space – Minimizes the clustering error in feature space Differences from k-means – Cluster centers m k in feature space cannot be computed – Each cluster C k is explicitly described by its data objects – Computation of distances from centers in feature space: Finds local minima - Strong dependence on the initial partition

Global Kernel k-Means In order to solve the M-clustering problem: 1.Solve the 1-clustering problem with Kernel k-Means (trivial solution) 2.Solve the k-clustering problem using the solution of the (k-1)-clustering problem a)Let denote the solution to the (k-1)-clustering problem b)Execute Kernel k-Means N times, initialized during the n-th run as c)Keep the run with the lowest clustering error as the solution with k clusters d)k := k+1 3.Repeat step 2 until k=M. The fast Global kernel k-means can be applied

Best Initial C 3 Best Initial C 4 Empty circles: optimal initialization of the cluster to be added Best Initial C 2

Global Kernel k-means - Applications MRI image segmentation Key frame extraction - shot clustering

Mixture Models Probability density estimation: estimate the density function model f(x) that generated a given dataset X={x 1,…, x N } Mixture Models – M pdf components φ j (x), – mixing weights: π 1, π 2, …, π M (priors) Gaussian Mixture Model (GMM): φ j = N(μ j, Σ j )

GMM (graphical model) Hidden variable πjπj observation

GMM examples 23 GMMs be used for density estimation (like histograms) or clustering Cluster memberhsip probability

Mixture Model training Given a dataset X={x 1,…, x N } and a GMM f (x;Θ) Likelihood: GMM training: log-likelihood maximization Expectation-maximization (EM) algorithm – Applicable when posterior P(Z|X) can be computed

EM for Mixture Models E-step: compute expectation of hidden variables given the observations: M-step: maximize expected complete likelihood

EM for GMM (M-step) Mean Covariance Mixing weights

EM Local Maxima

Greedy EM for GMM Start with k=1, f 1 (x)=N(μ 1, Σ 1 ), μ 1 =mean(X), Σ 1 =cov(X) Let f k the GMM solution with k components Let φ(x;μ,Σ) the k+1 component to be added Refine f k+1 (x) using EM -> final GMM with k+1 components

Greedy EM for GMM Σ=σΙ, Given θ=(μ,σ), α* can be computed analytically - Remark: the new component should be placed in a data region - Deterministic approach

Greedy-EM applications Image modeling for content-based retrieval and relevance feedback Motif discovery in sequences (discrete data, mixture of multinomials) Times series clustering (mixture of regression models)

Bayesian GMM Typical approach: Priors on all GMM parameters

Bayesian GMM training Parameters Θ become (hidden) RVs: H={Z, Θ} Objective: Compute Posteriors P(Z|X), P(Θ|X) (intractable) Approximations Sampling (RJMCMC) MAP approach Variational approach MAP approximation mode of the posterior P(Θ|Χ) (MAP-EM) compute P(Z|X,Θ MAP )

Variational Inference (no parameters) Computes approximation q(H) of the true posterior P(H|X) For any pdf q(H): Variational Bound (F) maximization Mean field approximation System of equations

Variational Inference (with parameters) X data, H hidden RVs, Θ parameters For any pdf q(H;Θ): Maximization of Variational Bound F Variational EM VE-Step: VM-Step:

Bayesian GMM training Bayesian GMMs mean field variational approximation tackles the covariance singularity problem requires to specify the parameters of the priors Estimating the number of components: Start with a large number of components Let the training process prune redundant components (π j =0) Dirichlet prior on π j prevents component prunning

Bayesian GMM without prior on π Mixing weights π j are parameters (remove Dirichlet prior) Training using Variational EM Method (C-B) Start with a large number of components Perform variational maximization of the marginal likelihood Prunning of redundant components (π j =0) Only components that fit well to the data are finally retained

Bayesian GMM (C-B) C-B method: Results depend on the number of initial components initialization of components specification of the scale matrix V of the Wishart prior p(T)

Incremental Bayesian GMM Modification of the Bayesian GMM is needed Divide the components as ‘fixed’ or ‘free’ Prior on the weights of ‘fixed’ components (retained) No prior on the weights of ‘free’ components (may be eliminated) Prunning restricted among ‘free’ components Solution: incremental training using component splitting Local scale matrix V: based on the variance of the component to be splitted

Incremental Bayesian GMM

Start with k=1 component. At each step: select a component j split component j in two subcomponents set the scale matrix V analogous to Σ j apply Variational EM considering the two subcomponents as free and the rest components as fixed either the two components will be retained and adjusted or one of them will be eliminated and the other one will recover the original component (before split) until all components have been tested for split unsuccessfully

Incremental Bayesian GMM Image segmentation Number of segments determined automatically

Incremental Bayesian GMM Image segmentation Number of segments determined automatically

Relevance Vector Machine RVM model (Tipping 2001) – φ i (x)=K(x,x i ) (same kernel function ‘centered’ on training example x i ) Fixed pool of N basis functions – Initially M=N basis functions: – Bayesian inference with sparse prior on w  prune redundant basis functions –Ο nly few basis functions are retained (relevance vectors)

Relevance Vector Machine Likelihood: Sparse prior of w: - Separate precision α i for each weight w i Weight prior p(w): Student's t (enforces sparsity)

RVM Training Maximize Marginal Likelihood Use Expectation Maximization (EM) Algorithm: – E-step: – M-step: Sparsity: Most

RVM example

RVM Incremental Training Incrementally add basis functions starting with empty model (Faul & Tipping 2003) Optimization w.r.t a single parameter α i Estimation of optimal α i analytical:

RVM Incremental Training At each iteration of the training algorithm – Compute optimal α i for all Ν basis functions – Select the best basis function φ i (x) from the pool of N candidates – Perform one of the following: Add this basis function to the current model Update α i (if it is included in the model) Remove this basis function (if it is included in the model and α i =∞)

RVM Limitations How to specify kernel parameter? (e.g. scale of RBF kernel) – Typical solution: Cross-validation Computationally expensive Cannot be used when many parameters must be adjusted How to model non-stationary functions? – RVM uses the same kernel for whole input space

Adaptive RVM with Kernel Learning (aRVM) Assume different parameters θ i for each φ(x;θ i ) RBF kernel: center m i and scale h i are parameters Generally m i different from training points x n Employ incremental RVM training Typical incremental RVM: select from a fixed set of N basis functions the best basis function to add aRVM: select the basis function φ(x;θ i ) to add by optimizing marginal likelihood sl(α i,θ i ) w.r.t (α i,θ i )

Sparsity Controlling Prior aRVM model is more flexible than typical RVM Employ a “stronger” prior on weights to enforce sparsity “Sparsity controlling prior” [ Schmolck & Everson 2007 ] – c= 0, typical RVM – c=log(N) (typical value in experiments) Prior can be written as: Likelihood is modified due to the new prior

Learning α i, θ i Maximize w.r.t to α i, θ i Alternate maximization steps Optimal α i (for fixed θ i ) : – for c=0 we obtain the incremental RVM update

Learning α i, θ i Maximize w.r.t to θ i (α i fixed) – Use quasi-Newton BFGS method (analytical derivatives) – Perform multiple restarts from several initial values of θ i and keep solution with best likelihood sl

aRVM Learning Algorithm Start from an empty model. At each iteration: 1.Optimize the parameters (α i,θ i ) of a new basis function φ(x; θ i ) and add it to the model 2.Train the current model: 1.Update parameters ( θ i ) of all current basis functions (BFGS updates) 2.Update parameters α i and β (noise precision) 3.Delete redundant basis functions (α i >10 12 ) 3.Repeat steps 1-2 until convergence The method can be used with any differential form of basis function φ(x; θ i )

aRVM Example (RBF kernel) Demos Tables

aRVM Example (RBF kernel) Demos Tables

Incremental Bayesian MLP Training For Sigmoidal Basis functions: we get the Multilayer Perceptron with one hidden layer. It is straightforward to apply the incremental kernel learning algorithm Tackles the model selection problem (number of hidden units) in MLP neural networks.

Incremental Learning: Current Research The dynamic nature of incremental training methods makes them particularly suitable for machine learning using stream data. Sparse & high dimensional data: text clustering Multiview-clustering Theoretical support for the successful empirical performance. Submodular cost function ( – The addition of a building block in a model M provides greater cost improvement than adding the same building block in a larger model M’ that includes M. – For submodular functions the simple greedy heuristic performs ‘surprinsingly’ well (better than 0.65* maximum). Challenge: prove that machine learning objective functions are (approximately) submodular (proved for k-medoids, feature selection, dictionary learning).

Thank you Collaborators N. Vlassis (Global k-means, Greedy EM) G. Tzortzis (Global kernel k-means) C. Constantinopoulos (Bayesian GMM) A. Kalogeratos (Dip-means) D. Tzikas, N. Galatsanos (aRVM) Matlab code available for all methods