Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Supervised Learning Recap
Computer vision: models, learning and inference
What is Statistical Modeling
Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
PATTERN RECOGNITION AND MACHINE LEARNING
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
INTRODUCTION TO Machine Learning 3rd Edition
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Gaussian Processes For Regression, Classification, and Prediction.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Gaussian Process Networks Nir Friedman and Iftach Nachman UAI-2K.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
RECITATION 2 APRIL 28 Spline and Kernel method Gaussian Processes Mixture Modeling for Density Estimation.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Who am I? Work in Probabilistic Machine Learning Like to teach 
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Computer vision: models, learning and inference
Machine learning, pattern recognition and statistical data modelling
Multimodal Learning with Deep Boltzmann Machines
Lecture 09: Gaussian Processes
Probabilistic Models for Linear Regression
CSCI 5822 Probabilistic Models of Human and Machine Learning
Biointelligence Laboratory, Seoul National University
Lecture 10: Gaussian Processes
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Machine Learning – a Probabilistic Perspective
Presentation transcript:

Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer

What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP – Regression GP – Regression with Squared Exponential Kernel CS 478 – INTRODUCTION 2

Regression CS 478 – INTRODUCTION 3

Paper Titles Gaussian process classification for segmenting and annotating sequences Discovering hidden features with Gaussian processes regression Active learning with Gaussian processes for object categorization Semi-supervised learning via Gaussian processes Discriminative Gaussian process latent variable model for classification Computation with infinite neural networks Fast sparse Gaussian process methods: The informative vector machine Approximate Dynamic Programming with Gaussian Processes CS 478 – INTRODUCTION 4

Paper Titles Learning to control an octopus arm with Gaussian process temporal difference measure Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp Gaussian process latent variable models for visualization of high dimensional data Bayesian unsupervised signal classification by Dirichlet process mixtures of Gaussian processes CS 478 – INTRODUCTION 5

Bayesian ML CS 478 – INTRODUCTION 6

Running Example CS 478 – INTRODUCTION 7

Prior GENERALEXAMPLE CS 478 – INTRODUCTION 8

Data Likelihood GENERALEXAMPLE CS 478 – INTRODUCTION 9

Posterior GENERALEXAMPLE CS 478 – INTRODUCTION 10

Posterior Predictive GENERALEXAMPLE CS 478 – INTRODUCTION 11

Other Regression Methods CS 478 – INTRODUCTION 12

Other Regression Methods CS 478 – INTRODUCTION 13

Other Regression Methods CS 478 – INTRODUCTION 14

Gaussian Process GP = Stochastic Process where Random Variables are Gaussian Distributed CS 478 – INTRODUCTION 15 x = 1 Distribution of f(1)

Gaussian Process - MVN CS 478 – INTRODUCTION 16

Gaussian Process - Background First theorized in 1940s Used in geostatistic and meterology in 1970s for time series prediction Under the name kriging CS 478 – INTRODUCTION 17

Smooth Functions CS 478 – INTRODUCTION 18

Dist. Over Functions CS 478 – INTRODUCTION 19

Covariance Function CS 478 – INTRODUCTION 20

Covariance Functions CS 478 – INTRODUCTION 21

Covariance Functions CS 778 – GAUSSIAN PROCESS 22

CS 478 – INTRODUCTION 23 Hyperparameters Length-Scale

Inference/Prediction The joint distribution of training targets and test target is MVN We condition the MVN on the observed training targets Resulting conditional distribution is Gaussian Use distribution mean/covariance for prediction CS 478 – INTRODUCTION 24

Magic Math From Wikipedia CS 478 – INTRODUCTION 25

Discussion PROS Learned function is smooth but not constrained by form Confidence intervals Good prior for other models E.g. Bias N-degree polynomials to be smooth Can do multiple test points that co-vary appropriately CONS Math/Optimization is messy Use lots of approximations NxN Matrix inversion Numerical Stability issues Scalability in # instances issues like SVMs High dimensional data is challenging CS 478 – INTRODUCTION 26

Implementation CS 478 – INTRODUCTION 27

Example CS 478 – INTRODUCTION 28 X f(X) ? ?

Example – Covariance Matrix CS 478 – INTRODUCTION 29 X f(X) ? ?

Example – Matrix Inversion or CS 478 – INTRODUCTION 30 X f(X) ? ? Next step, invert K

Example – Multiply by Targets CS 478 – INTRODUCTION 31 X f(X) ? ?

Example – Predicting f(2) CS 478 – INTRODUCTION 32 X f(X) ? ? Now need k* for f(2) k** = 1 (no obs noise)

CS 478 – INTRODUCTION 33 Example – Predicting f(2)

CS 478 – INTRODUCTION 34 Example – Predicting f(4)

CS 778 – GAUSSIAN PROCESS 35 Example - Plot X f(X)

My Experiments CS 778 – GAUSSIAN PROCESS 36

CS 778 – GAUSSIAN PROCESS 37 Results - Linear ModelMSE GP0.009 Lasso0.019 KNN0.058

CS 778 – GAUSSIAN PROCESS 38 Results - Cubic ModelMSE GP0.046 Lasso0.037 KNN0.101

CS 778 – GAUSSIAN PROCESS 39 Results - Periodic ModelMSE GP0.087 Lasso1.444 KNN0.350

CS 778 – GAUSSIAN PROCESS 40 Results - Sigmoid ModelMSE GP0.034 Lasso0.045 KNN0.060

CS 778 – GAUSSIAN PROCESS 41 Results - Complex ModelMSE GP0.194 Lasso0.574 KNN0.327

Real Dataset Concrete Slump Test How fast does concrete flow given its materials? 103 instances; 7 input features; 3 outputs; all real values Input Features are kg per cubic meter of concrete Cement, Slag, Fly Ash, Water, SP, Coarse Agg, Fine Agg Output Variables are Slump, Flow, 28-day Compressive Strength Ran 10-fold CV with our three models on each output variable CS 778 – GAUSSIAN PROCESS 42

Concrete Results CS 778 – GAUSSIAN PROCESS 43 Slump ModelAvg MSE +- Std GP Lasso KNN Flow ModelAvg MSE +- Std GP Lasso KNN Strength ModelAvg MSE +- Std GP Lasso KNN

Learning Hyperparameters Find hyperparameters to maximize (log) probability of data Can use Gradient Descent by computing for is just the elementwise derivative of the covariance matrix which is the derivative of the covariance function CS 778 – GAUSSIAN PROCESS 44

Learning Hyperparameters For squared exponential covariance function in this form We get the following derivatives CS 778 – GAUSSIAN PROCESS 45

Covariance Function Tradeoffs CS 778 – GAUSSIAN PROCESS 46

Classification Many regression methods become binary classification by passing output through a sigmoid, Softmax, or other threshold MLP, SVM, Logistic Regression Same with GP, but scarier math CS 778 – GAUSSIAN PROCESS 47

CS 778 – GAUSSIAN PROCESS 48 Classification Math Vote for class 1 Weight for vote where GP priorignored Intractable! GP Regression

Bells and Whistles Multiple Regression – Assume correlation between multiple outputs Correlated noise – especially useful for time Training set values + derivatives (max or min occurs at…) Noise in the x-direction Mixture (ensemble) of local GP experts Integral approximation with GP Covariance over structured inputs (strings, trees, etc) CS 778 – GAUSSIAN PROCESS 49

Bibliography Wikipedia: Gaussian Process, Multi-variate Normal Scikit-Learn Documentation C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN X. c 2006 Massachusetts Institute of Technology. S. Marsland, Machine Learning An Algorithmic Perspective 2 nd Edition, CRC Press, 2015 CS 778 – GAUSSIAN PROCESS 50