Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

STATISTICS Sampling and Sampling Distributions

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Copyright ©2004 Carlos Guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Two-step linear equations Variables.

1 Eloise E. Kaizar The Ohio State University Combining Information From Randomized and Observational Data: A Simulation Study June 5, 2008 Joel Greenhouse.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)

1 -Classification: Internal Uncertainty in petroleum reservoirs.

On Sequential Experimental Design for Empirical Model-Building under Interval Error Sergei Zhilin, Altai State University, Barnaul, Russia.

Beyond Convexity – Submodularity in Machine Learning

1 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A.

BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

ABC Technology Project

Intelligent Light Control using Sensor Networks Vipul Singhvi 1,3, Andreas Krause 2, Carlos Guestrin 2,3, Jim Garrett 1, Scott Matthews 1 Carnegie Mellon.

Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.

Squares and Square Root WALK. Solve each problem REVIEW:

Design formulation ● design disciplines ● differences ● commonalities ● formulation 1/24.

Dialogue Policy Optimisation

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

Complexity ©D.Moshkovits 1 Where Can We Draw The Line? On the Hardness of Satisfiability Problems.

We will resume in: 25 Minutes.

Figure Essential Cell Biology (© Garland Science 2010)

A SMALL TRUTH TO MAKE LIFE 100%

1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.

Probabilistic Reasoning over Time

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

Carnegie Mellon Selecting Observations against Adversarial Objectives Andreas Krause Brendan McMahan Carlos Guestrin Anupam Gupta TexPoint fonts used in.

Near-Optimal Sensor Placements in Gaussian Processes Carlos Guestrin Andreas KrauseAjit Singh Carnegie Mellon University.

Efficient Informative Sensing using Multiple Robots

A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft.

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.

Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.

Near-optimal Observation Selection using Submodular Functions Andreas Krause joint work with Carlos Guestrin (CMU)

1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Abdallah Kassir 1. Information Theory Entropy: Conditional Entropy: Mutual Information: 2.

Linear Codes for Distributed Source Coding: Reconstruction of a Function of the Sources -D. Krithivasan and S. Sandeep Pradhan -University of Michigan,

On Rearrangements of Fourier Series Mark Lewko TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A AAA A A A A.

Toward Community Sensing Andreas Krause Carnegie Mellon University Joint work with Eric Horvitz, Aman Kansal, Feng Zhao Microsoft Research Information.

Privacy-Preserving Linear Programming Olvi Mangasarian UW Madison & UCSD La Jolla UCSD – Center for Computational Mathematics Seminar January 11, 2011.

Another story on Multi-commodity Flows and its “dual” Network Monitoring Rohit Khandekar IBM Watson Joint work with Baruch Awerbuch JHU TexPoint fonts.

Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.

Machine Learning and Data Mining Clustering

Monitoring rivers and lakes [IJCAI ‘07]

Near-optimal Observation Selection using Submodular Functions

TexPoint fonts used in EMF.

TexPoint fonts used in EMF.

TexPoint fonts used in EMF.

Machine Learning and Data Mining Clustering

Near-Optimal Sensor Placements in Gaussian Processes

Presentation transcript:

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A A AAA A A A A

River monitoring  Want to monitor ecological condition of river  Need to decide where to make observations! Mixing zone of San Joaquin and Merced rivers NIMS (UCLA)

Observation Selection for Spatial prediction  Gaussian processes  Distribution over functions (e.g., how pH varies in space)  Allows estimating uncertainty in prediction Horizontal position pH value observations Unobserved process Prediction Confidence bands

Mutual Information [Caselton Zidek 1984]  Finite set of possible locations V  For any subset A µ V, can compute Want: A* = argmax MI(A) subject to |A| ≤ k  Finding A* is NP hard optimization problem  Entropy of uninstrumented locations after sensing Entropy of uninstrumented locations before sensing

 Want to find: A* = argmax |A|=k MI(A)  Greedy algorithm:  Start with A = ;  For i = 1 to k  s* := argmax s MI(A [ {s})  A := A [ {s*} The greedy algorithm for finding optimal a priori sets Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh] Optimal solution Result of greedy algorithm Constant factor, ~63%

Sequential design  Observed variables depend on previous measurements and observation policy   MI(  ) = expected MI score over outcome of observations X 5 =? X 3 =?X 2 =? <20°C ¸ 20°C X 7 =? >15°C MI( X 5 =17, X 3 =16, X 7 =19 ) = 3.4 X 5 =17X 5 =21 X 3 =16 X 7 =19 X 12 =?X 23 =? ¸ 18°C <18°C MI( … ) = 2.1MI( … ) = 2.4 Observation policy  MI(  ) = 3.1

A priori vs. sequential  Sets are very simple policies. Hence: max A MI(A) · max  MI(  ) subject to |A|=|  |=k  Key question addressed in this work: How much better is sequential vs. a priori design?  Main motivation:  Performance guarantees about sequential design?  A priori design is logistically much simpler!

GPs slightly more formally  Set of locations V  Joint distribution P(X V )  For any A µ V, P(X A ) Gaussian  GP defined by  Prior mean  (s) [often constant, e.g., 0]  Kernel K(s,t) V …… XVXV  1 : Variance (Amplitude)  2 : Bandwidth Example: Squared exponential kernel Distance Correlation

Known parameters Known parameters  (bandwidth, variance, etc.) No benefit in sequential design! max A MI(A) = max  MI(  ) Mutual Information does not depend on observed values:

Mutual Information does depend on observed values! Unknown parameters Unknown (discretized) parameters: Prior P(  =  ) Sequential design can be better! max A MI(A) · max  MI(  ) depends on observations!

Theorem: Key result: How big is the gap?  If  =  known: MI(A*) = MI(  *)  If  “almost” known: MI(A*) ¼ MI(  *) MI MI(A*) MI(  *) 0 Gap depends on H(  ) MI of best policy MI of best set Gap size As H(  ) ! 0: MI of best policy MI of best param. spec. set

Near-optimal policy if parameter approximately known  Use greedy algorithm to optimize MI(A greedy |  ) =   P(  ) MI(A greedy |  )  Note:  | MI(A |  ) – MI(A) | · H(  )  Can compute MI(A |  ) analytically, but not MI(A) Corollary [using our result from ICML 05] Optimal seq. plan Result of greedy algorithm ~63% Gap ≈ 0 (known par.)

Exploration—Exploitation for GPs Reinforcement Learning Active Learning in GPs ParametersP(S t+1 |S t, A t ), Rew(S t ) Kernel parameters  Known parameters: Exploitation Find near-optimal policy by solving MDP! Find near-optimal policy by finding best set Unknown parameters: Exploration Try to quickly learn parameters! Need to waste only polynomially many robots!  Try to quickly learn parameters. How many samples do we need?

Parameter info-gain exploration (IGE)  Gap depends on H(  )  Intuitive heuristic: greedily select s * = argmax s I(  ; X s ) = argmax s H(  ) – H(  | X s )  Does not directly try to improve spatial prediction  No sample complexity bounds  Parameter entropy before observing s P.E. after observing s

Implicit exploration (IE)  Intuition: Any observation will help us reduce H(  )  Sequential greedy algorithm: Given previous observations X A = x A, greedily select s * = argmax s MI ({X s } | X A =x A,  )  Contrary to a priori greedy, this algorithm takes observations into account (updates parameters) Proposition: H(  | X  ) · H(  ) “Information never hurts” for policies No sample complexity bounds 

Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!  Learning the bandwidth Kernel Bandwidth Sensors within bandwidth are correlated Sensors outside bandwidth are ≈ independent A B C

 Square exponential kernel:  Choose pairs of samples at distance  to test correlation! Hypothesis testing: Distinguishing two bandwidths Correlation under BW=1 Correlation under BW=3 At this distance  correlation gap largest BW = 1 BW =

Hypothesis testing: Sample complexity Theorem: To distinguish bandwidths with minimum gap  in correlation and error <  we needindependent samples.  In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)  Other tests can be used for variance/noise etc.  What if we want to distinguish more than two bandwidths?

P(  )  Find “most informative split” at posterior median Hypothesis testing: Binary searching for bandwidth Testing policy  ITE needs only logarithmically many tests!  Theorem: If we have tests with error <  T then

Exploration—Exploitation Algorithm  Exploration phase  Sample according to exploration policy  Compute bound on gap between best set and best policy  If bound < specified threshold, go to exploitation phase, otherwise continue exploring.  Exploitation phase  Use a priori greedy algorithm select remaining samples  For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples! 

IE ITE IGE IE IGE ITE Results  None of the strategies dominates each other  Usefulness depends on application More RMS error More observations More param. uncertainty Temperature data IGE: Parameter info-gain ITE: Hypothesis testing IE: Implicit exploration

Nonstationarity by spatial partitioning  Isotropic GP for each region, weighted by region membership  spatially varying linear combination Stationary fit Nonstationary fit  Problem: Parameter space grows exponentially in #regions!  Solution: Variational approximation (BK-style) allows efficient approximate inference (Details in paper) 

IE, nonstationary IE, isotropic a priori, nonstationary Results on river data  Nonstationary model + active learning lead to lower RMS error More RMS error More observations Larger bars = later sample

IE, isotropic IGE, nonstationary IE, nonstationary Random, nonstationary IE nonstationary IGE nonstationary Results on temperature data  IE reduces error most quickly  IGE reduces parameter entropy most quickly More RMS error More observations More param. uncertainty More observations

Conclusions  Nonmyopic approach towards active learning in GPs  If parameters known, greedy algorithm achieves near-optimal exploitation  If parameters unknown, perform exploration  Implicit exploration  Explicit, using information gain  Explicit, using hypothesis tests, with logarithmic sample complexity bounds!  Each exploration strategy has its own advantages  Can use bound to compute stopping criterion  Presented extensive evaluation on real world data