Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayesian Belief Propagation
Beyond Convexity – Submodularity in Machine Learning
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Niranjan Srinivas Andreas Krause Caltech Caltech
CMPUT 466/551 Principal Source: CMU
Carnegie Mellon Selecting Observations against Adversarial Objectives Andreas Krause Brendan McMahan Carlos Guestrin Anupam Gupta TexPoint fonts used in.
Near-Optimal Sensor Placements in Gaussian Processes Carlos Guestrin Andreas KrauseAjit Singh Carnegie Mellon University.
Planning under Uncertainty
Efficient Informative Sensing using Multiple Robots
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.
Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.
Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Nonlinear and Non-Gaussian Estimation with A Focus on Particle Filters Prasanth Jeevan Mary Knox May 12, 2006.
Near-optimal Observation Selection using Submodular Functions Andreas Krause joint work with Carlos Guestrin (CMU)
1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.
Machine Learning CMPT 726 Simon Fraser University
1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Maximum Likelihood (ML), Expectation Maximization (EM)
Linear Codes for Distributed Source Coding: Reconstruction of a Function of the Sources -D. Krithivasan and S. Sandeep Pradhan -University of Michigan,
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Toward Community Sensing Andreas Krause Carnegie Mellon University Joint work with Eric Horvitz, Aman Kansal, Feng Zhao Microsoft Research Information.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
Randomized Algorithms for Bayesian Hierarchical Clustering
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Machine Learning 5. Parametric Methods.
Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.
Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Monitoring rivers and lakes [IJCAI ‘07]
Near-optimal Observation Selection using Submodular Functions
István Szita & András Lőrincz
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
ECE 5424: Introduction to Machine Learning
Filtering and State Estimation: Basic Concepts
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 2: Evaluative Feedback
Machine Learning: Lecture 6
Near-Optimal Sensor Placements in Gaussian Processes
Machine Learning: UNIT-3 CHAPTER-1
Chapter 2: Evaluative Feedback
Presentation transcript:

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A A AA

River monitoring Want to monitor ecological condition of river Need to decide where to make observations! Mixing zone of San Joaquin and Merced rivers NIMS (Kaiser et al, UCLA)

Observation selection for spatial prediction Gaussian processes Allow prediction at unobserved locations (regression) Allows estimating uncertainty in prediction Horizontal position pH value observations Unobserved process Prediction Confidence bands

Mutual Information [Caselton Zidek 1984] Finite set of possible locations V For any subset A µ V, can compute Want: A* = argmax MI(A) subject to |A| · k Finding A* is NP hard optimization problem  Entropy of uninstrumented locations after sensing Entropy of uninstrumented locations before sensing

Want to find: A* = argmax |A|=k MI(A) Greedy algorithm: Start with A = ; For i = 1 to k s* := argmax s MI(A [ {s}) A := A [ {s*} The greedy algorithm Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh] Optimal solution Result of greedy algorithm Constant factor, ~63%

A priori vs. sequential Greedy algorithm finds near-optimal a priori set: Sensors are placed before making observations In many practical cases, we want to sequentially select observations: Select next observation depending on the previous observations made Focus of the talk!

Sequential design Observed variables depend on previous measurements and observation policy  MI(  ) = expected MI score over outcome of observations X 5 =? X 3 =?X 2 =? X 7 =? MI( X 5 =17, X 3 =16, X 7 =19 ) = 3.4 X 5 =17X 5 =21 X 3 =16 X 7 =19 X 12 =?X 23 =? MI( … ) = 2.1MI( … ) = 2.4 Observation policy  MI(  ) = 3.1

Is sequential better? Sets are very simple policies. Hence: max A MI(A) · max  MI(  ) subject to |A|=|  |=k Key question addressed in this work: How much better is sequential vs. a priori design? Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!

GPs slightly more formally Set of locations V Joint distribution P(X V ) For any A µ V, P(X A ) Gaussian GP defined by Prior mean  (s) [often constant, e.g., 0] Kernel K(s,t) Position along transect (m) pH value V …… XVXV  1 : Variance (Amplitude)  2 : Bandwidth Example: Squared exponential kernel

Known parameters Known parameters  (bandwidth, variance, etc.) No benefit in sequential design! max A MI(A) = max  MI(  ) Mutual Information does not depend on observed values:

Unknown parameters Unknown parameters: Bayesian approach: Prior P(  =  ) Sequential design can be better! max A MI(A) · max  MI(  ) Mutual Information does depend on observed values! depends on observations! Assume discretized in this talk

Key intuition of our main result If  =  known: MI(A*) = MI(  *) If  “almost” known: MI(A*) ¼ MI(  *) “Almost” known  H(  ) small MI MI(A*) MI(  *) 0 How large is this gap? Gap depends on H(  ) No gap! Best set Best policy

How big is the gap? Theorem: As H(  ) ! 0 If H(  ) small, no point in active learning: we can concentrate on finding the best set A*!

Near-optimal policy if parameter approximately known Use greedy algorithm to optimize MI(A greedy |  ) =   P(  ) MI(A greedy |  ) Corollary [using our result from ICML 05] Optimal seq. plan Result of greedy algorithm ~63% Gap ¼ 0 (known par.) If parameters almost known, can find near-optimal sequential policy. What if parameters are unknown?

Exploration—Exploitation for GPs Reinforcement Learning Active Learning in GPs ParametersP(S t+1 |S t, A t ), Rew(S t ) Kernel parameters  (Almost) Known parameters: Exploitation Find near-optimal policy by solving MDP! Find near-optimal policy by finding best set Unknown parameters: Exploration Try to quickly learn parameters! Need to waste only polynomially many robots! Try to quickly learn parameters. How many samples do we need?

Info-gain exploration (IGE) Gap depends on H(  ) Intuitive heuristic: greedily select s * = argmax s H(  ) – H(  | X s ) No sample complexity bounds  Does not directly try to improve spatial prediction

Implicit exploration (IE) Sequential greedy algorithm: Given previous observations X A = x A, greedily select s * = argmax s MI ({s} | X A =x A,  ) Contrary to a priori greedy, this algorithm takes observations into account (updates parameters) Proposition: H(  | X  ) · H(  ) “Information never hurts” for policies No sample complexity bounds  Neither of the two strategies has sample complexity bounds  Is there any way to get them?

Can narrow down kernel bandwidth by sensing within and outside bandwidth distance! Learning the bandwidth Kernel Bandwidth Sensors within bandwidth are correlated Sensors outside bandwidth are ¼ independent

Square exponential kernel: Choose pairs of samples at distance  to test correlation! Hypothesis testing: Distinguishing two bandwidths Correlation under BW=3 Correlation under BW=1 Distance  BW = 1 BW = 3 correlation gap largest

Hypothesis testing: Sample complexity Theorem: To distinguish bandwidths with minimum gap  in correlation and error <  we needindependent samples. In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper) Other tests can be used for variance/noise etc. What if we want to distinguish more than two bandwidths?

Find “most informative split” at posterior median Hypothesis testing: Searching for bandwidth Test: BW>2? Test: BW>3? Testing policy  ITE needs only logarithmically many tests!

Theorem: If we have tests with error <  T then Hypothesis testing: Exploration Theorem  T error probability of hypothesis tests  ITE Hypothesis testing exploration policy Logarithmic sample complexity

Exploration—Exploitation Algorithm Exploration phase Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase, otherwise continue exploring. Exploitation phase Use a priori greedy algorithm select remaining samples For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!

Results None of the strategies dominates each other Usefulness depends on application More RMS error More observations More param. uncertainty Temperature data IGE: Parameter info-gain ITE: Hypothesis testing IE: Implicit exploration

River data Isotropic process is a bad fit Need nonstationary approach pH data from Merced river

Nonstationarity by spatial partitioning Exploration—Exploitation approach applies to nonstationary models as well! Partition into regions Isotropic GP for each region, weighted by region membership Final GP is spatially varying linear combination

Nonstationary GPs Nonstationary model fits much better Problem: Parameter space blows up exponentially in #regions Solution: Variational approximation (BK-style) allows efficient approximate inference (Details in paper) Stationary fitNonstationary fit

Results on River data Nonstationary model + active learning lead to lower RMS error More RMS error More observations pH data from Merced river (Kaiser et al.)

Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves near-optimal exploitation If parameters unknown, perform exploration Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample complexity bounds! Each exploration strategy has its own advantages Can use bound to compute when to stop exploring Presented extensive evaluation on real world data See poster yesterday for more details