Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

CSCE643: Computer Vision Bayesian Tracking & Particle Filtering Jinxiang Chai Some slides from Stephen Roth.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Pattern Recognition and Machine Learning

Data Mining Classification: Alternative Techniques

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Chapter 4: Linear Models for Classification

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Pattern Recognition and Machine Learning

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

x – independent variable (input)

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

1 Robust Video Stabilization Based on Particle Filter Tracking of Projected Camera Motion (IEEE 2009) Junlan Yang University of Illinois,Chicago.

Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Nonlinear and Non-Gaussian Estimation with A Focus on Particle Filters Prasanth Jeevan Mary Knox May 12, 2006.

Particle Filters for Mobile Robot Localization 11/24/2006 Aliakbar Gorji Roborics Instructor: Dr. Shiri Amirkabir University of Technology.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.

Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.

Online Learning Algorithms

Muhammad Moeen YaqoobPage 1 Moment-Matching Trackers for Difficult Targets Muhammad Moeen Yaqoob Supervisor: Professor Richard Vinter.

PATTERN RECOGNITION AND MACHINE LEARNING

Efficient Model Selection for Support Vector Machines

Computer vision: models, learning and inference Chapter 19 Temporal models.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.

Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.

CS Statistical Machine learning Lecture 24

Linear Models for Classification

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Lecture 2: Statistical learning primer for biologists

Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

PREDICT 422: Practical Machine Learning

Extending Expectation Propagation for Graphical Models

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Ch3: Model Building through Regression

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Machine Learning Basics

Probabilistic Models for Linear Regression

Biointelligence Laboratory, Seoul National University

Extending Expectation Propagation for Graphical Models

Presentation transcript:

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang

Motivation Ubiquitous data stream: s, stock prices, images from satellites, video surveillance How to process a data stream using a small memory buffer and make accurate predictions?

Outline Introduction Virtual Vector Machine Experimental Results Summary

Introduction Online learning: – Update model and make predictions based on data points received sequentially – Use a fixed-size memory buffer

Classical online learning Classification: – Perceptron Linear regression: – Kalman filtering

Bayesian treatment Monte Carlo methods (e.g., particle filters) – Difficult for classification model due to high dimensionality Deterministic methods: – Assumed density filtering: Gaussian process classification models (Csato 2002).

Virtual Vector Machine preview Two parts: – Gaussian approximation factors – Virtual points for nonGaussian factors Summarize multiple real data points Flexible functional forms Stored in data cache with a user-defined size.

Outline Introduction Virtual Vector Machine Experimental Results Summary

Online Bayesian classification Model parameters: Data from time 1 to T : Likelihood function at time t: Prior distribution: Posterior at time T :

Flipping noise model Labeling error rate : Feature vector scaled by 1 or -1 depending on the label. Posterior distribution: planes cutting a sphere for 3-D case.

Gaussian approximation by EP approximates the likelihood Both and have the form of Gaussian. Therefore, is a Gaussian.

VVM enlarges approximation family : virtual point : exact form of the original likelihood function. (Could be more flexible.) : residue

Reduction to Gaussian From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior : is Gaussian too.

Cost function for finding virtual points Minimizing cost function with ADF spirit: contains one more nonlinear factor than. Maximizing surrogate function: Keep informative (non-Gaussian) information in virtual points. Computationally intractable…

Cost function for finding virtual points Minimizing cost function with ADF spirit: contains one more nonlinear factor than. Maximizing surrogate function: Keep informative (non-Gaussian) information in virtual points.

Two basic operations Searching over all possible locations for : computationally expensive! For efficiency, consider only two operations to generate virtual points: – Eviction: delete the least informative point – Merging: merge two similar points to one

Eviction After adding the new point into the virtual point set, – Select by maximizing – Remove from the cache – Update the residual via

Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after deleting one point (with the largest margin)

Merging Remove from the cache, Insert the merged point into the cache Update the residual via where Gaussian residual term captures the lost information from the original two factors. Equivalent to replace by

Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after merging two similar points

Compute residue term Inverse ADF: match the moments of the left and right distributions: Efficiently solved by Gauss-Newton method as an one-dimensional problem

Algorithm Summary

Random feature expansion (Rahimi & Recht, 2007): For RBF kernels, we use random Fourier features: Where are sampled from a special. Classification with random features

Outline Introduction Virtual Vector Machine Experimental Results Summary

Estimation accuracy of posterior mean Mean square error of estimated posterior mean obtained by EP, virtual vector machine, ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.

Online classification (1) Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

Online nonlinear classification (2) Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.

Online nonlinear classification (3) Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.

Summary Efficient Bayesian online classification A small constant space cost A smooth trade-off between prediction accuracy and computational cost Improved prediction accuracy over alternative methods More flexible functional form for virtual points, and other applications