Download presentation
Presentation is loading. Please wait.
Published byPriscilla Stevens Modified over 9 years ago
1
A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012
2
1. Problem statement. 2. General Approach. 3. Formulation using dynamic model. 4. Reinforcement learning solution. 5. Experiments. 6. Conclusions. Plan
3
PROBLEM STATEMENT Adaptive sampling X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9) Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A) -> Cost of observing variables X(A) B -> Initial budget Observations are reliable
4
PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategy / sampling policy to adaptively select variables to observe in order to : Optimize Quality of the reconstruction of X / Respect Initial Budget
5
PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategies / sampling policy to adaptively select variables to observed in order to : Optimize Quality of the reconstructed vector X / Respect Initial Budget
6
For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy
7
A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x(A 1 )=0 -- Example -- x(A 2 )=(1,3) For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)
8
DEFINITIONS A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} Vocabulary : A history { (A 1,x(A 1 )),…,(A H,x(A H )} is a trajectory followed when applying : set of all reachable histories of c() ≤ B cost of any history respects the initial budget x(A 2 )=(1,3) x(A 1 )=0 For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1.
9
GENERAL APPROACH
10
1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:
11
STATE OF THE ART 1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy: X continuous random vector ℙ multivariate Gaussian joint distribution Entropy based criterion Kriging variance Greedy algorithm
12
1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy: X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution ℙ Markov random field distribution Entropy based criterions Maximum Posterior Marginals (MPM) Kriging variance OUR CONTRIBUTION Greedy algorithm Reinforcement learning
13
1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy: X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution ℙ Markov random field distribution Entropy based criterions Maximum Posterior Marginals (MPM) Kriging variance OUR CONTRIBUTION Greedy algorithm Reinforcement learning
14
Formulation using dynamic model An adapted framework for reinforcement learning
15
Summarize knowledge on X in a random vector S of length n Observe variables update our knowledge on X Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1
16
Summarize knowledge on X in a random vector S of length n Observe variables update our knowledge on X Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1
17
Reinforcement learning solution
18
Find optimal policy: The Q-function Compute Q* Compute * = « The expected value of the history when starting in s t, observing variables X(A t )and then following policy * »
19
Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) A1A1 A2A2 How to compute Q*: classical solution (Q-learning …) with proba 1- random with proba 1. Initialize Q 2. Simulate history
20
Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) Update Q(s 1,A 1 )Update Q(s 2,A 2 ) A1A1 A2A2 with proba 1- random with proba many times! How to compute Q*: classical solution (Q-learning …) 1. Initialize Q 2. Simulate history QQ* C.V
21
Alternative approach Linear approximation of Q-function: Choice of function Φ i :
22
LSDP Algorithm Linear approximation of Q-function: Define weights for each decision step Compute weights using “backward induction” H
23
LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of
24
LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of We fix|A t |=1 and use the approximation:
25
Experiments
26
Regular grid with first order neighbourhood. X(i) are binary variables. ℙ is a Potts model with β=0.5 Simple cost: observation of each variale cost 1 X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)
27
Experiments Comparison between: Random policy BP-max heuristic: at each time step observed variable LSPI policy “ common reinforcement learning algorithm” LSDP policy using score:
28
Experiment: 100 variables (n=100)
29
LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals
30
LSDP et BP-max : one observation Action’s value for LSDP policy Max marginals
31
LSDP et BP-max : two observations Action’s value for LSDP policy Max marginals
32
Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !
33
Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !
34
Experiment: 100 variables - constraint move
35
Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38
36
Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38 Cost Repartition: RandomBP-maxLSDP
37
Experiment: 100 variables - Different cost RandomBP-MaxLSDP Policy Value65.3%66.2%67% Mean number of observed variables 15.4 WR 165.9%65.4%67% WR 061%63.2%64% Initial Budget = 30 Cost of 1 when observed variable is in state 0 Cost of 3 when observed variable is in state 1
38
Conclusions An adapted framework for adaptive sampling in discrete random variables LSDP: a reinforcement learning approach for finding near optimal policy Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem Computation of near optimal policy « off-line » Design of new policies that outperform simple heuristics and usual RL method Possible application? Weeds sampling in crop field
40
Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) =Reconstruction of X R Maximum Posterior Marginal for reconstruction:
41
Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) Maximum Posterior Marginal for reconstruction: Quality of trajectory:
42
LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 sHsH a H-1 aHaH U (s H+1 )
43
LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 U (s H+1 ) sHsH aHaH LINEAR SYSTEM wHwH
44
LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 LINEAR SYSTEM w H-1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.