Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University.

Similar presentations


Presentation on theme: "© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University."— Presentation transcript:

1 © 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University © 2008 Warren B. Powell, Princeton University

2 © 2008 Warren B. Powell 2 Slide 2 Outline Introduction

3 © 2008 Warren B. Powell 3 Applications Sports »Who should be in the batting lineup for a baseball team? »What is the best group of five basketball players out of a team of 12 to be your starting lineup? »Who are the best four people to man the four-person boat for crew racing? »Who will perform the best in competition for your gymnastics team?

4 © 2008 Warren B. Powell 4 Applications Figure out Manhattan: »Walking »Subway/walking »Taxi »Street bus »Driving

5 © 2008 Warren B. Powell 5 Applications Biomedical research »How do we find the best drug to cure cancer? »There are millions of combinations, with laboratory budgets that cannot test everything. »We need a method for sequencing experiments.

6 © 2008 Warren B. Powell 6 Applications Biosurveillance »What is the prevalence of drug-resistant TB, MRSA, HIV/AIDS, malaria…., in the population? »How do we efficiently collect information about the state of disease around the world? »What are the best strategies for minimizing transmission? Deaths from vector-born diseases

7 © 2008 Warren B. Powell 7 Applications High technology »What is the best sensor to use to evaluate the status of optics for the National Ignition Facility? »When should lenses be inspected? »How often should an experiment be run to test a new hypothesis on the physics of fusion? National Ignition Facility

8 © 2008 Warren B. Powell 8 Applications Stochastic optimization »Stochastic search over surfaces that can only be measured with uncertainty »Simulation-optimization – What is the best set of parameters to produce the best manufacturing configuration? »Active learning – How do we choose which samples to collect for machine learning applications? »Exploration vs. exploitation in approximation dynamic programming – How do we decide which states to visit to balance our need to estimate the value of being in a state versus the reward from visiting a state?

9 © 2008 Warren B. Powell 9 Introduction Deterministic optimization »Find the choice with the highest reward (assumed known): The winner!

10 © 2008 Warren B. Powell 10 Introduction Stochastic optimization »Now assume the reward you will earn is stochastic, drawn from a normal distribution. The reward is revealed after the choice is made. The winner!

11 © 2008 Warren B. Powell 11 Introduction Optimal learning »Now, you have a budget of 10 measurements to determine which of the 5 choices is best. You have an initial probability distribution for the reward that each will return, but you are willing to change your belief as you make choices. How should you sequence your measurements to produce the best answer in the end? We might keep trying the option we think is best: … but what if the third or fourth choice is actually the best?

12 © 2008 Warren B. Powell 12 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5

13 © 2008 Warren B. Powell 13 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5 No improvement

14 © 2008 Warren B. Powell 14 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5 New solution The value of learning is that it may change your decision.

15 © 2008 Warren B. Powell 15© 2008 Warren B. Powell Slide 15 Outline Types of learning problems

16 © 2008 Warren B. Powell 16 Elements of a learning problem Things we have to think about: »How do we make measurements? What is the nature of the measurement decision? »What is the effect of a measurement? How does it change our state of knowledge? »What do we do with the results of what we learn from a measurement? What is the nature of the implementation decision? »How do we evaluate how well we have done with the results of our measurement? »Do we learn as we go, or are we able to make a series of measurements before solving a problem?

17 © 2008 Warren B. Powell 17 Elements of a learning problem Types of measurement decisions » Stopping problems – observe until you have to make a decision, such as selling an asset. » Finite (and not too big) set of choices » Subset selection –What is the best group of people for a sports team –What is the best subset of energy saving technologies for a building » What is the best price, density, temperature, speed » Linear, nonlinear and integer programming

18 © 2008 Warren B. Powell 18 Elements of a learning problem Optimal learning »Now assume that you do not know the distribution of the reward, although you have an estimate (a “prior”). »After you make your choice, you observe the actual reward which changes your belief about the distribution of rewards. Observation

19 © 2008 Warren B. Powell 19 Elements of a learning problem Updating the distribution »Frequentist view Assume we start with observations: Statistics: Frequentist interpretation: – and are random variables reflecting the randomness in the observations of

20 © 2008 Warren B. Powell 20 Elements of a learning problem Updating the distribution »Bayesian view We assume we start with a distribution of belief about the true mean Next we observe, which we assume comes from a distribution with variance (we assume the variance is known). Using Bayes theorem, we can show that our new distribution of belief about the true mean is normally distributed with mean and variance. We first define the precision of a distribution as the inverse variance: – The updating formulas are

21 © 2008 Warren B. Powell 21 Elements of a learning problem Frequentist vs. Bayesian »For optimal learning applications, we are generally in the situation where we have some knowledge about our choices, and we have to decide which one to measure to improve our final decision. »The state of knowledge: Frequentist view: Bayesian view: »For the remainder of our talk, we will adopt a Bayesian view since it allows us to introduce prior knowledge, a common property of learning problems.

22 © 2008 Warren B. Powell 22 Elements of a learning problem Relationships between beliefs and measurements »Beliefs Uncorrelated – What we know about one choice tells us nothing about what we know about another choice Correlated – If our belief of one choice is high, our belief about another choice might be higher »Measurement noise Uncorrelated - If we were to make two measurements at the same time, the measurements are independent. Correlated: –At a point in time – Simultaneous measurements are correlated. –Over time – Measurements of different choices may or may not be correlated, but measurements of the same choice at different points in time are correlated.

23 © 2008 Warren B. Powell 23 Elements of a learning problem Types of learning probems »On-line learning Learn as you earn Give example problems –Finding the best path to work –What is the best set of energy-saving technologies to use for your building? –What is the best medication to control your diabetes? »Off-line learning There is a phase of information collection with a finite (sometimes small) budget. You are allowed to make a series of measurements, after which you make an implementation decision. Examples: –Finding the best drug compound through laboratory experiments –Finding the best design of a manufacturing configuration or engineering design which is evaluated using an expensive simulation. –What is the best combination of designs for hydrogen production, storage and conversion.

24 © 2008 Warren B. Powell 24 Elements of a learning problem Measuring the benefits of knowledge: »Minimizing/maximizing a cost or reward Minimizing expected cost/maximizing reward or utility Minimizing expected opportunity cost (minimizing the gap from the best possible) Collecting information to produce a better solution to an optimization problem. »Making the right choice Maximizing the probability of making the correct selection Indifference zone selection – maximizing the probability of collecting a choice whose performance is within of the optimal. »Statistical measures Minimizing a measure (square, absolute value) of the distance between observations and a predictive function (classical estimation) Minimizing a metric (e.g. Kullback-Leibler divergence) measuring the distance between actual and predicted probability distributions. Minimizing entropy (or entropic loss)

25 © 2008 Warren B. Powell 25© 2008 Warren B. Powell Slide 25 Outline Measurement policies

26 © 2008 Warren B. Powell 26 Measurement policies What do we know? »The real average path times: »Mean time Path 1 20 minutes Path 2 22 minutes Path 3 24 minutes Path 4 26 minutes Errors are +/- 10 minutes »What we think: Path 1 25 minutes Path 2 24 minutes Path 3 22 minutes Path 4 20 minutes »We act by choosing the path that we “think” is the best. The only way we learn anything new is by choosing a path.

27 © 2008 Warren B. Powell 27 Measurement policies Illustration of calculations:

28 © 2008 Warren B. Powell 28 Measurement policies

29 © 2008 Warren B. Powell 29 Measurement policies

30 © 2008 Warren B. Powell 30 Measurement policies

31 © 2008 Warren B. Powell 31 Measurement policies

32 © 2008 Warren B. Powell 32 Measurement policies For problems with a finite number of alternatives »On-line learning (learn as you earn) This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff. It is necessary to trade off what you think you will earn with each decision, against the value of the information you will gain that might improve decisions in the future. »Off-line learning You have a budget for taking measurements. After your budget is exhausted, you have to make a final choice. This is known as the ranking and selection problem.

33 © 2008 Warren B. Powell 33 Measurement policies Elements of a measurement policy: »Deterministic or sequential Deterministic policy - you decide what you are going to measure in advance. Sequential policy – Future measurements depend on past observations. »Designing a measurement policy We have to strike a balance between the value of a good measurement policy and the cost of computing it If we are drilling oil exploration holes, we might be willing to spend a day on the computer deciding what to do next We may need a trivial calculation if we are guiding an algorithm that will perform thousands of iterations. »Evaluating a policy The goal is to find a policy that gets us close enough to the truth that we make the optimal (or near-optimal) decisions To do this, we have to assume a truth, and then use a policy to try to guess at the truth.

34 © 2008 Warren B. Powell 34 Measurement policies Finding an optimal policy »Dynamic programming formulation Let be the “state of knowledge” –E.g. if we have 10 choices, each with a mean and variance, our state would be An optimal learning policy is characterized by Bellman’s equation: »Computational challenges State variable has 20 dimensions, each is continuous. Solving this is impossible (and this is a simple problem!)

35 © 2008 Warren B. Powell 35 Measurement policies Special case: on-line learning with independent beliefs »Multi-armed bandit problem – Which slot machine should I try next to maximize total expected rewards? »Breakthrough (Gittins and Jones, 1974) Do not need to solve the high-dimensional dynamic program Compute a single index (the “Gittins index”) for each slot machine Try the slot machine with the largest index For normally distributed rewards, the index looks like: »Notes Yao (2006) and Brezzi and Lai (2002) provide analytical approximation for Despite extensive literature on index policies, range of applications is fairly limited. Standard deviation of measurementGittins index for mean zero, variance 1Current estimate of the reward from machine x

36 © 2008 Warren B. Powell 36 Measurement policies Heuristic measurement policies »Pure exploitation – Always make the choice that appears to be the best. »Pure exploration – Make choices at random so that you are always learning more, but without regard to the cost of the decision. »Hybrid Explore with probability and exploit with probability Epsilon-greedy exploration – explore with probability. Goes to zero as, but not too quickly. »Boltzmann exploration Explore choice x with probability »Interval estimation (upper confidence bounding) Choose x which maximizes

37 © 2008 Warren B. Powell 37 Measurement policies Approximate policies for off-line learning »Optimal computing budget allocation (Chen et al) Formulates the problem of allocating a set of observations as an optimization problem subject to a budget constraint »LL(s) – Batch linear loss (Chick et al) »Maximizing the expected value of a single measurement (R1, R1, …,R1) Gupta and Miescke (1996) EVI (Chick, Branke and Schmidt, under review) “Knowledge gradient” (Frazier and Powell, 2008)

38 © 2008 Warren B. Powell 38 Measurement policies Evaluating measurement policies »How do we compare one measurement policy to another? »One possibility: … but we would be wrong!

39 © 2008 Warren B. Powell 39 Measurement policies Illustration »Setup: Option 1 is worth 15 Remaining 999 options are worth 10 Standard deviation of a measurement is 5 »Policy 1: Measure each option 10 times »Policy 2: Measure remaining 999 options once. Measure first option 9,001 times »Which measurement policy produces the best result?

40 © 2008 Warren B. Powell 40 Measurement policies Measuring each alternative 10 times Best choice

41 © 2008 Warren B. Powell 41 Measurement policies Measuring option 1 9,001 times, and everything else once. Lucky choice

42 © 2008 Warren B. Powell 42 Measurement policies What did we find? »Although option 1 is best, we will almost always identify some other option as being better, just through randomness. This method rewards collecting too little information. A better way: »Assume a truth for each x. We do this by choosing a sample realization of a truth from a prior probability distribution for the mean. »Given this truth, apply policy to produce statistical estimates given by. Let be the best solution based on these estimates. Repeat this n times and evaluate the policy using » Note: This must be done with realistic (but not real) data.

43 © 2008 Warren B. Powell 43© 2008 Warren B. Powell Slide 43 Outline The knowledge gradient policy

44 © 2008 Warren B. Powell 44 The knowledge gradient Basic principle: »Assume you can make only one measurement, after which you have to make a final choice (the implementation decision). »What choice would you make now to maximize the expected value of the implementation decision? 1234 5 Change in estimate of value of option 5 due to measurement. Change which produces a change in the decision.

45 © 2008 Warren B. Powell 45 The knowledge gradient General model »Off-line learning – We have a measurement budget of N observations. After we do our measurements, we have to make an implementation decision. »Notation:

46 © 2008 Warren B. Powell 46 The knowledge gradient »The knowledge gradient is the expected value of a single measurement x, given by »The challenge is a computational one: how do we compute the expectation? Knowledge stateImplementation decisionUpdated knowledge state given measurement xExpectation over different measurement outcomesMarginal value of measuring x (the knowledge gradient) Optimization problem given what we knowNew optimization problem

47 © 2008 Warren B. Powell 47 The knowledge gradient Derivation »Notation »We update the precision using »In terms of the variance, this is the same as

48 © 2008 Warren B. Powell 48 The knowledge gradient Derivation »The change in variance can be found to be »Next compute the normalized influence: »Let »Knowledge gradient is computed using

49 © 2008 Warren B. Powell 49 The knowledge gradient Knowledge gradient 1234 5

50 © 2008 Warren B. Powell 50 The knowledge gradient The knowledge gradient policy Properties »Effectively a myopic policy, but also similar to steepest ascent for nonlinear programming. »The best single measurement you can make (by construction) »Asymptotically optimal (more difficult proof). As the measurement budget grows, we get the optimal solution. »The knowledge gradient policy is the only stationary policy with this behavior. Many policies are asymptotically optimal (e.g. pure exploration, hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal.

51 © 2008 Warren B. Powell 51 The knowledge gradient Current estimate of value of a decisionCurrent estimate of standard deviationValue of knowledge gradient

52 © 2008 Warren B. Powell 52 The knowledge gradient

53 © 2008 Warren B. Powell 53 The knowledge gradient

54 © 2008 Warren B. Powell 54 The knowledge gradient Experimental comparisons: »KG vs: Boltzmann Interval estimation Equal allocation OCBA Pure exploitation Linear loss KG - Boltzmann KG – Equal alloc KG – Pure exploit KG – Interval est. KG – OCBA KG – LL(S)

55 © 2008 Warren B. Powell 55 The knowledge gradient Notes: »KG slightly outperforms Interval Estimation (IE), OCBA, and LL(S), and is easier to compute than OCBA and LL(S). »KG is fairly easy to compute for independent, normally distributed rewards. »But KG is a general concept which generalizes to other important problem classes: On-line applications Correlated beliefs Correlated measurements (e.g. Common Random Numbers) … more general optimization problems

56 © 2008 Warren B. Powell 56© 2008 Warren B. Powell Slide 56 Outline The knowledge gradient for on-line applications  Research by Ilya Ryzhov, Ph.D. candidate, Princeton University

57 © 2008 Warren B. Powell 57 KG for on-line learning problems Setting »Assume we have a budget of N measurements and that we have made n measurements. »As before, let be the value of measuring x after n measurements. »Proposal: Assume that each of the remaining decisions is improved by. So an estimate of the value of measuring x is given by »For infinite horizon problems: What we think we will earn now The value of the information now on future decisions

58 © 2008 Warren B. Powell 58 KG for on-line learning problems On-line KG vs. Gittins On-line KG slightly outperforms Gittins. On-line KG matches Gittins.

59 © 2008 Warren B. Powell 59 KG for on-line learning problems Experimental comparisons: »KG-online vs: Gittins indices Interval estimation Pure exploitation Uniform prior Heterogeneous prior KG versus: Gittins Interval estimation Pure exploitation

60 © 2008 Warren B. Powell 60© 2008 Warren B. Powell Slide 60 Outline The knowledge gradient with correlated beliefs  Presented by Peter Frazier, Ph.D. candidate, Princeton University

61 © 2008 Warren B. Powell 61 The correlated knowledge gradient (CKG) In the independent knowledge gradient, measuring one alternative taught us nothing about the others. Sometimes, this is not true. In such cases we should use the correlated knowledge gradient. 1234 1234 5 5

62 © 2008 Warren B. Powell 62 The correlated knowledge gradient (CKG) Measuring one alternative can teach us about other alternatives. Examples »finding the best price at which to sell a product. demand at a price of $8 is close to demand at a price of $9 »marketing budget allocation. two budgets that differ only in the amount allocated to radio advertising will produce similar effects. »choosing the temperature and pressure at which to run a chemical process. »choosing a combination of products to sell in a product line. two product lines with all but one product in common will produce similar revenue streams. »choosing a combination of drugs to treat a disease. »finding a chemical for a particular medical or industrial purpose. two chemicals sharing similar molecular structures behave similarly.

63 © 2008 Warren B. Powell 63 measure here... The correlated knowledge gradient (CKG) To compute the correlated KG policy, we use the same formula, but now F(y,K(x)) changes from F(y,x) for every y, not just for y=x. This changes the expectation. 1234 5...these beliefs change too.

64 © 2008 Warren B. Powell 64 CKG for Wind Farm Location The posterior knowledge state K(x) is written here as K(x,W) to indicate its dependence on the observation W.

65 © 2008 Warren B. Powell 65 CKG for Wind Farm Location

66 © 2008 Warren B. Powell 66 CKG for Wind Farm Location

67 © 2008 Warren B. Powell 67 CKG for Wind Farm Location

68 © 2008 Warren B. Powell 68 CKG for Wind Farm Location

69 © 2008 Warren B. Powell 69 CKG for Wind Farm Location

70 © 2008 Warren B. Powell 70 CKG for Wind Farm Location

71 © 2008 Warren B. Powell 71 CKG for Wind Farm Location

72 © 2008 Warren B. Powell 72 CKG for Wind Farm Location

73 © 2008 Warren B. Powell 73 CKG for Wind Farm Location

74 © 2008 Warren B. Powell 74 CKG for Wind Farm Location

75 © 2008 Warren B. Powell 75 CKG for Wind Farm Location

76 © 2008 Warren B. Powell 76 CKG for Wind Farm Location

77 © 2008 Warren B. Powell 77 CKG for Wind Farm Location

78 © 2008 Warren B. Powell 78 CKG for Wind Farm Location

79 © 2008 Warren B. Powell 79 CKG for Wind Farm Location

80 © 2008 Warren B. Powell 80 CKG for Wind Farm Location

81 © 2008 Warren B. Powell 81 CKG for Wind Farm Location

82 © 2008 Warren B. Powell 82 CKG for Wind Farm Location

83 © 2008 Warren B. Powell 83 CKG for Wind Farm Location F(1,K)+s 1 W Best location is 1.

84 © 2008 Warren B. Powell 84 CKG for Wind Farm Location F(34,K)+s 34 W Best location is 34.

85 © 2008 Warren B. Powell 85 CKG for Wind Farm Location F(48,K)+s 48 W Best location is 48.

86 © 2008 Warren B. Powell 86 CKG for Wind Farm Location F(49,K)+s 49 W Best location is 49.

87 © 2008 Warren B. Powell 87 F(1,K)+s 1 W F(34,K)+s 34 W F(49,K)+s 49 W F(48,K)+s 48 W CKG for Wind Farm Location where 1, 34, 48, 49 are the locations that may be best. a 1,...,a 4 are the values of F(y,K) at these locations. b 1,...,b 4 are the values of s y at these locations.

88 © 2008 Warren B. Powell 88 a 1 +b 1 W a 2 +b 2 W a 4 +b 4 W a 3 +b 3 W CKG for Wind Farm Location where 1, 34, 48, 49 are the locations that may be best. a 1,...,a 4 are the values of F(y,K) at these locations. b 1,...,b 4 are the values of s y at these locations.

89 © 2008 Warren B. Powell 89 CKG for Wind Farm Location where f(z) =  (z) + z  (z),  is the normal pdf, and  is the normal cdf. a 1 +b 1 W a 2 +b 2 W a 3 +b 3 W a 4 +b 4 W W Then the knowledge gradient is given by

90 © 2008 Warren B. Powell 90 CKG for Wind Farm Location Wind Speed Function’s true value Our belief (mean and standard deviation) Locations

91 © 2008 Warren B. Powell 91 CKG for Wind Farm Location Wind Speed Locations Measurement

92 © 2008 Warren B. Powell 92 CKG for Wind Farm Location Locations Wind Speed

93 © 2008 Warren B. Powell 93 CKG for Wind Farm Location Locations Wind Speed

94 © 2008 Warren B. Powell 94 CKG in 1 Dimension Opportunity Cost (OC) = (value of best alternative) - (value of implemented alternative). Comparison of: CKG = correlated knowledge gradient SKO = sequential kriging optimization (Huang et al. 2006) IKG = independent knowledge gradient. # of measurements y

95 © 2008 Warren B. Powell 95 CKG in 2 Dimensions Branin function Typical behavior Average behavior

96 © 2008 Warren B. Powell 96 CKG for Diabetes Treatments Healthy people have fasting plasma glucose (FPG) levels of 4 to 6 mmol/L. Diabetes patients have FPG levels of 10 to 15 mmol/L. 4 diabetes drugs A,B,C,D. »By themselves, each drug reduces FPG between 0.5 and 2 mmol/L. Goal: find the combination of 3 drugs that most reduces FPG. »FPG reduction is approximately additive. Combination effect A B C 0 6 0 mmol/L 6 D ABCABC BCDBCD ABDABD

97 © 2008 Warren B. Powell 97 CKG for Diabetes Treatments Individual drug and combination effect are unknown. We have a prior belief on their values. The FPG reduction achieved by a combination can be measured with noise. What sequence of combinations should we test to find the best one? ABC 0 6 mmol/L D 6 0.......... ABCABC BCDBCD ABDABD ACDACD

98 © 2008 Warren B. Powell 98 CKG for Diabetes Treatments When we measure one drug we learn about all combinations containing that drug. B 0 6 mmol/L Before measurement of B 0 mmol/L 6 ABCABCBCDBCDABDABDACDACD

99 © 2008 Warren B. Powell 99 CKG for Diabetes Treatments When we measure one drug we learn about all combinations containing that drug. B 0 6 mmol/L After measurement of B 0 mmol/L 6 ABCABCBCDBCDACDACDABDABD

100 © 2008 Warren B. Powell 100 CKG for Diabetes Treatments When we measure a combination of drugs, we learn about all other combinations with drugs in common. Before measurement of ABC 0 mmol/L 6 ABCABCBCDBCDACDACDABDABD 0 0 A ABCABC C ABDABD D ACDACD B BCDBCD

101 © 2008 Warren B. Powell 101 CKG for Diabetes Treatments When we measure a combination of drugs, we learn about all other combinations with drugs in common. 0 mmol/L 6 ABCABCBCDBCDACDACDABDABD After measurement of ABC 0 mmol/L 0 A ABCABC C ABDABD D ACDACD B BCDBCD

102 © 2008 Warren B. Powell 102 CKG for Diabetes Treatment where f(z) =  (z) + z  (z),  is the normal pdf, and  is the normal cdf. The knowledge gradient for this problem is given by, Formula is identical to that from the wind farm example. a i corresponds to F(y,K), as before. b i is the “sensitivity” of F(y,K(x)) to the observation resulting from x. This is computed to reflect the correlation structure of the diabetes treatments.

103 © 2008 Warren B. Powell 103© 2008 Warren B. Powell Slide 103 Outline Optimal learning on a graph  Research by Ilya Ryzhov, Ph.D. candidate, Princeton University

104 © 2008 Warren B. Powell 104 Information collection on a graph Congestion detection »We have a vehicle that is traveling segments of the network to evaluate congestion. Your information will then be used to route emergency vehicles (the implementation decision). »Traveling an arc from i to j means it is now possible to look at arcs out of j. »How should you traverse the network?

105 © 2008 Warren B. Powell 105 Information collection on a graph Biosurveillance »You are part of a medical team traveling around Africa to measure the presence of malaria in different parts of the continent. »Visiting one location changes the cost of visiting other locations. »How do you sequence your measurements to have the great impact on prevention strategies (the implementation decision). Presence of malaria

106 © 2008 Warren B. Powell 106 Information collection on a graph Optimal routing over a graph: Current node (e.g. node 2) Decision to go to a node (e.g. 5) Downstream node (e.g. 5)

107 © 2008 Warren B. Powell 107 Information collection on a graph Ranking and selection/bandit problems »Making a measurement changes our belief about the value of each choice 1234 5 Current state of knowledge Decision to make a measurement New state of knowledge

108 © 2008 Warren B. Powell 108 Information collection on a graph Notation »Bellman’s equation can be used to describe transitions to different nodes of a network, or changes in our distributional knowledge about the value of each decision. »To distinguish between a physical state (the node we are at) and the knowledge state (the probability distribution describing our understanding of the value of a decision), let Physical state (“resource state”) Knowledge state

109 © 2008 Warren B. Powell 109 Information collection on a graph Optimal routing over a graph:

110 © 2008 Warren B. Powell 110 Information collection on a graph Optimal routing over a graph »The shortest path

111 © 2008 Warren B. Powell 111 Information collection on a graph Optimal routing over a graph »The shortest path »The path we explore

112 © 2008 Warren B. Powell 112 Information collection on a graph Optimal routing over a graph »The shortest path »The path we explore »Now we have a new shortest path »How do we decide which links to measure?

113 © 2008 Warren B. Powell 113 Information collection on a graph The knowledge gradient on a graph »We can apply the knowledge gradient concept directly »How do we compute the expected value of a stochastic shortest path problem? What we believe about link costs Current shortest path problem Based on what we know Updated distributions of arc costs Shortest path problem after the update Expected value of updated shortest path problem

114 © 2008 Warren B. Powell 114 Information collection on a graph The knowledge gradient on a graph »The knowledge gradient is almost the same as it was for the ranking and selection problem »Just think of each path through the network as a possible choice »Recall that the knowledge gradient is given by »where »and

115 © 2008 Warren B. Powell 115 Information collection on a graph Optimal routing over a graph »For the graph problem, the normalized influence is given by »The normalized influence is given by Value of best path that includes link (i,j) Value of best path that does not include link (i,j) Measurement link (i,j)

116 © 2008 Warren B. Powell 116


Download ppt "© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University."

Similar presentations


Ads by Google