Presentation is loading. Please wait.

Presentation is loading. Please wait.

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Similar presentations


Presentation on theme: "Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop."— Presentation transcript:

1 Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop on Clustering and Search Techniques in Large Scale Networks LATNA, Nizhny Novgorod, Russia, November 4, 2014

2 Part I. Bayesian Kernel Methods for Gaussian Processes

3 Why Bayesian Learning? Returns a probability Incorporates power of kernel methods with advantages of Bayesian updating Can incorporate prior knowledge into estimation Can “learn” fairly quickly if Gaussian process Can be used for regression or classification

4 Outline 1.Bayesics 2.Relevance Vector Machine 3.Laplace Approximation 4.Results

5 Bayes’ Rule Calculated valuePosterior PriorLikelihood

6 Logistic Likelihood

7 Prior Assume t(x) = {t(x 1 ), …, t(x m )} is a Gaussian process (normally distributed) Let t = K α

8 Maximize Posterior Goal: Find optimal values for α

9 Minimize Negative Log = 1 if y i = 0 = 1 if y i = 1

10 Minimize Negative Log

11 Relevance Vector Machine Combines Bayesian approach with sparseness of support vector machine Previously Hyperparameter 1/Variance(α i )

12 Non-Informative (Flat) Prior Let a = b ≈ 0

13 Maximize Posterior

14 Laplace Approximation Newton-Raphson Method cici C ii

15 Iteration

16 Optimizing Hyperparameter Need a closed-form expression for If α|y,s were normally distributed, then at optimal Use Gaussian approximation

17 SVM and RVM Comparison Similar accuracy with fewer “support” vectors

18 Conclusion Posterior Likelihood x Prior Gaussian process ▫ Makes math easier ▫ Assumes that density is centered around mode Relevance Vector Machine ▫ Similar accuracy to Support Vector Machine ▫ Fewer data points for RVM compared to SVM In part 2 we discuss ▫ Non-Gaussian process ▫ Markov Chain Monte Carlo solution

19 References B. Schӧlkopf and A.J. Smola, 2002. “Chapter 16: Bayesian Kernel Methods.” Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT Press. C.M. Bishop and M.E. Tipping, 2003. “Bayesian Regression and Classification.” In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, eds. Advances in Learning Theory: Methods Models and Applications. Amsterdam: IOS Press.

20

21 Likelihood for Classification Logistic Probit

22 Likelihood for Regression

23 RVM for Regression where

24 Incremental Updating for Regression

25 Industrial Engineering College of Engineering Part II. Bayesian Kernel Methods Using Beta Distributions Theodore Trafalis Workshop on Clustering and Search Techniques in Large Scale Networks LATNA, Nizhny Novgorod, Russia, November 4, 2014

26 Summary of Part 1 Bayesian method: Posterior Likelihood x Prior Gaussian process ▫ Makes math easier ▫ Assumes that density is centered around mode Relevance Vector Machine Solution concept: posterior maximization

27 Current Bayesian Kernel methods Combine Bayesian probability with Kernel Methods n data points, m attributes per data point X is n x m matrix y is n x 1 vector of 0s and 1s  (X) is a function of X used to predict y Posterior Prior Likelihood

28 28 MacKenzie and Trafalis Support Vector Machines

29 What’s new in part 2 Beta distributions as priors Adaptation of beta-binomial updating formula Comparison of beta kernel classifiers with existing SVM classifiers Online learning

30 Outline 1.Beta Distribution 2.Other Priors 3.Markov Chain Monte Carlo 4.Test Case

31 Likelihood Logistic Likelihood Bernoulli Likelihood

32 Beta Distribution Prior

33 33 MacKenzie and Trafalis Shape of beta distribution

34 Beta-binomial conjugate Prior Likelihood Posterior Number of ones Number of trials

35 α and β Let α i and β i be a function of x i Assume

36 α and β Let α i and β i be a function of x i Assume

37 α and β Let α i and β i be a function of x i Assume

38 α and β Let α i and β i be a function of x i Assume

39 Applying beta-binomial to data mining Prior Posterior Number of zeros in training set Parameter to be tuned

40 Classification Rule

41 41 MacKenzie and Trafalis Testing on data sets Beta prior is uniform:  = 1  = 1 Rates represent mean values of percent of ones or zeros correctly classified

42 42 MacKenzie and Trafalis Online learning Updated probabilities for one data point from tornado data y = 0 y = 1 Each trial uses 100 data points to update prior

43 Conclusions Adapting the beta-binomial updating rule to a kernel-based classifier can create a fast and accurate data mining algorithm User can set prior and weights to reflect imbalanced data sets Results are comparable to weighted SVM Online learning combines previous and current information

44 Options for Prior Distributions α and β must be greater than 0 Assume and are independent Some choices

45 Kernel Function

46 Directed Acyclic Graph μαμα sr σασα μβμβ σβσβ γ x K+K+ K-K- αβ θ

47 Markov Chain Monte Carlo (MCMC) Simulation tool used for calculating posterior distributions Gibbs Sampler: iterates using conditional distributions

48 Markov Chain Monte Carlo (MCMC) Simulation tool used for calculating posterior distributions Gibbs Sampler: iterates using conditional distributions

49 Markov Chain Monte Carlo (MCMC) Simulation tool used for calculating posterior distributions Gibbs Sampler: iterates using conditional distributions

50 Markov Chain Monte Carlo (MCMC) Simulation tool used for calculating posterior distributions Gibbs Sampler: iterates using conditional distributions Software ▫ Bayesian Inference Using Gibbs Sampling (BUGS) ▫ Just Another Gibbs Sampler (JAGS)

51 Toy Example

52 Parameters for Priors

53 Large Gamma Needed

54 Results

55 Test Data Automatically Calculated

56 Comparison

57 Conclusion Advantages of Beta-Bayesian Method ▫ Incorporate non-Gaussian process ▫ Results of example equal to SVM ▫ Testing data automatically calculated with MCMC Disadvantages ▫ MCMC slow algorithm ▫ Analytical solution may not be possible ▫ Difficult to determine prior distributions Future Work ▫ Real data ▫ More comparisons with existing methods

58 References Cameron A. MacKenzie, Theodore B. Trafalis and Kash Barker, “A Bayesian Beta Kernel Model for Binary Classification and Online Learning Problems”, Statistical Analysis and Data Mining, Vol. (In press 2014)


Download ppt "Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop."

Similar presentations


Ads by Google