Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A A AA
River monitoring Want to monitor ecological condition of river Need to decide where to make observations! Mixing zone of San Joaquin and Merced rivers NIMS (Kaiser et al, UCLA)
Observation selection for spatial prediction Gaussian processes Allow prediction at unobserved locations (regression) Allows estimating uncertainty in prediction Horizontal position pH value observations Unobserved process Prediction Confidence bands
Mutual Information [Caselton Zidek 1984] Finite set of possible locations V For any subset A µ V, can compute Want: A* = argmax MI(A) subject to |A| · k Finding A* is NP hard optimization problem Entropy of uninstrumented locations after sensing Entropy of uninstrumented locations before sensing
Want to find: A* = argmax |A|=k MI(A) Greedy algorithm: Start with A = ; For i = 1 to k s* := argmax s MI(A [ {s}) A := A [ {s*} The greedy algorithm Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh] Optimal solution Result of greedy algorithm Constant factor, ~63%
A priori vs. sequential Greedy algorithm finds near-optimal a priori set: Sensors are placed before making observations In many practical cases, we want to sequentially select observations: Select next observation depending on the previous observations made Focus of the talk!
Sequential design Observed variables depend on previous measurements and observation policy MI( ) = expected MI score over outcome of observations X 5 =? X 3 =?X 2 =? X 7 =? MI( X 5 =17, X 3 =16, X 7 =19 ) = 3.4 X 5 =17X 5 =21 X 3 =16 X 7 =19 X 12 =?X 23 =? MI( … ) = 2.1MI( … ) = 2.4 Observation policy MI( ) = 3.1
Is sequential better? Sets are very simple policies. Hence: max A MI(A) · max MI( ) subject to |A|=| |=k Key question addressed in this work: How much better is sequential vs. a priori design? Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!
GPs slightly more formally Set of locations V Joint distribution P(X V ) For any A µ V, P(X A ) Gaussian GP defined by Prior mean (s) [often constant, e.g., 0] Kernel K(s,t) Position along transect (m) pH value V …… XVXV 1 : Variance (Amplitude) 2 : Bandwidth Example: Squared exponential kernel
Known parameters Known parameters (bandwidth, variance, etc.) No benefit in sequential design! max A MI(A) = max MI( ) Mutual Information does not depend on observed values:
Unknown parameters Unknown parameters: Bayesian approach: Prior P( = ) Sequential design can be better! max A MI(A) · max MI( ) Mutual Information does depend on observed values! depends on observations! Assume discretized in this talk
Key intuition of our main result If = known: MI(A*) = MI( *) If “almost” known: MI(A*) ¼ MI( *) “Almost” known H( ) small MI MI(A*) MI( *) 0 How large is this gap? Gap depends on H( ) No gap! Best set Best policy
How big is the gap? Theorem: As H( ) ! 0 If H( ) small, no point in active learning: we can concentrate on finding the best set A*!
Near-optimal policy if parameter approximately known Use greedy algorithm to optimize MI(A greedy | ) = P( ) MI(A greedy | ) Corollary [using our result from ICML 05] Optimal seq. plan Result of greedy algorithm ~63% Gap ¼ 0 (known par.) If parameters almost known, can find near-optimal sequential policy. What if parameters are unknown?
Exploration—Exploitation for GPs Reinforcement Learning Active Learning in GPs ParametersP(S t+1 |S t, A t ), Rew(S t ) Kernel parameters (Almost) Known parameters: Exploitation Find near-optimal policy by solving MDP! Find near-optimal policy by finding best set Unknown parameters: Exploration Try to quickly learn parameters! Need to waste only polynomially many robots! Try to quickly learn parameters. How many samples do we need?
Info-gain exploration (IGE) Gap depends on H( ) Intuitive heuristic: greedily select s * = argmax s H( ) – H( | X s ) No sample complexity bounds Does not directly try to improve spatial prediction
Implicit exploration (IE) Sequential greedy algorithm: Given previous observations X A = x A, greedily select s * = argmax s MI ({s} | X A =x A, ) Contrary to a priori greedy, this algorithm takes observations into account (updates parameters) Proposition: H( | X ) · H( ) “Information never hurts” for policies No sample complexity bounds Neither of the two strategies has sample complexity bounds Is there any way to get them?
Can narrow down kernel bandwidth by sensing within and outside bandwidth distance! Learning the bandwidth Kernel Bandwidth Sensors within bandwidth are correlated Sensors outside bandwidth are ¼ independent
Square exponential kernel: Choose pairs of samples at distance to test correlation! Hypothesis testing: Distinguishing two bandwidths Correlation under BW=3 Correlation under BW=1 Distance BW = 1 BW = 3 correlation gap largest
Hypothesis testing: Sample complexity Theorem: To distinguish bandwidths with minimum gap in correlation and error < we needindependent samples. In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper) Other tests can be used for variance/noise etc. What if we want to distinguish more than two bandwidths?
Find “most informative split” at posterior median Hypothesis testing: Searching for bandwidth Test: BW>2? Test: BW>3? Testing policy ITE needs only logarithmically many tests!
Theorem: If we have tests with error < T then Hypothesis testing: Exploration Theorem T error probability of hypothesis tests ITE Hypothesis testing exploration policy Logarithmic sample complexity
Exploration—Exploitation Algorithm Exploration phase Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase, otherwise continue exploring. Exploitation phase Use a priori greedy algorithm select remaining samples For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!
Results None of the strategies dominates each other Usefulness depends on application More RMS error More observations More param. uncertainty Temperature data IGE: Parameter info-gain ITE: Hypothesis testing IE: Implicit exploration
River data Isotropic process is a bad fit Need nonstationary approach pH data from Merced river
Nonstationarity by spatial partitioning Exploration—Exploitation approach applies to nonstationary models as well! Partition into regions Isotropic GP for each region, weighted by region membership Final GP is spatially varying linear combination
Nonstationary GPs Nonstationary model fits much better Problem: Parameter space blows up exponentially in #regions Solution: Variational approximation (BK-style) allows efficient approximate inference (Details in paper) Stationary fitNonstationary fit
Results on River data Nonstationary model + active learning lead to lower RMS error More RMS error More observations pH data from Merced river (Kaiser et al.)
Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves near-optimal exploitation If parameters unknown, perform exploration Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample complexity bounds! Each exploration strategy has its own advantages Can use bound to compute when to stop exploring Presented extensive evaluation on real world data See poster yesterday for more details