Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering Princeton University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A A A AA
2 Overview Problem Formulation Knowledge Gradient Policy Theoretical Results Numerical Results
3 Measurement Phase Alternative 1 N opportunities to do experiments Alternative M Experimental outcomes Alternative Alternative 2
4 Implementation Phase Alternative 1 Alternative M Reward Alternative 3 Alternative 2
5 Reward Structure On-lineOff-line SequentialX Batch Sampling Taxonomy of Sampling Problems
6 Example 1 One common experimental design is to spread measurements equally across the alternatives. Alternative Quality
7 Example 1 Round Robin Exploration
8 How might we improve round-robin exploration for use with this prior? Example 2
9 Largest Variance Exploration
10 Example 3 Exploitation:
11 Model x n is the alternative tested at time n. Measure the value of alternative x n At time n,, independent Error n+1 is independent N(0,( ) 2 ). At time N, choose an alternative. Maximize
12 State Transition At time n measure alternative x n. We update our estimate of based on the measurement Estimates of other Y x do not change.
13 At time n, n+1 x is a normal random variable with mean n x and variance satisfying uncertainty about Y x before the measurement uncertainty about Y x after the measurement change in best estimate of Y x due to the measurement n The value of the optimal policy satisfies Bellman’s equation
14 Utility of Information Consider our “utility of information”, and consider the random change in utility due to a measurement at time n
15 Knowledge Gradient Definition The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,
16 Knowledge Gradient We may compute the knowledge gradient policy via which is the expectation of the maximum of a normal and a constant.
17 Knowledge Gradient The computation becomes where is the normal cdf, is the normal pdf, and
18 Optimality Results If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.
19 Optimality Results The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity. This is really a convergence result.
20 Optimality Results The knowledge gradient policy has sub-optimality bounded by where V KG,n gives the value of the knowledge gradient policy and V n the value of the optimal policy.
21 Optimality Results If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal. In this case, the optimal policy reduces to
22 Optimality Results If there is no measurement noise and alternatives may be reordered so that then the knowledge gradient policy is optimal.
23 Numerical Experiments 100 randomly generated problems M Uniform {1,...100} N Uniform {M, 3M, 10M} 0 x Uniform [-1,1] ( 0 x ) 2 = 1 with probability 0.9 = with probability 0.1 = 1
24 Numerical Experiments
25 Compare alternatives via a linear combination of mean and standard deviation. The parameter z /2 controls the tradeoff between exploration and exploitation. Interval Estimation
26 KG / IE Comparison
27 KG / IE Comparison Value of KG – Value of IE
28 IE and “Sticking” Alternative 1 is known perfectly
29 IE and “Sticking”
30 Thank You Any Questions?
31 Numerical Example 1
32 Numerical Example 2
33 Numerical Example 3
34 Knowledge Gradient Example
35 Interval Estimation Example
36 Boltzmann Exploration Parameterized by a declining sequence of temperatures (T 0,...T N-1 ).