Presentation is loading. Please wait.

Presentation is loading. Please wait.

A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Similar presentations


Presentation on theme: "A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:"— Presentation transcript:

1 A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012

2 1. Problem statement. 2. General Approach. 3. Formulation using dynamic model. 4. Reinforcement learning solution. 5. Experiments. 6. Conclusions. Plan

3 PROBLEM STATEMENT Adaptive sampling X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9) Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A) -> Cost of observing variables X(A) B -> Initial budget Observations are reliable

4 PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategy / sampling policy to adaptively select variables to observe in order to : Optimize Quality of the reconstruction of X / Respect Initial Budget

5 PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategies / sampling policy to adaptively select variables to observed in order to : Optimize Quality of the reconstructed vector X / Respect Initial Budget

6 For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy

7 A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x(A 1 )=0 -- Example -- x(A 2 )=(1,3) For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

8 DEFINITIONS A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} Vocabulary : A history { (A 1,x(A 1 )),…,(A H,x(A H )} is a trajectory followed when applying : set of all reachable histories of c() ≤ B  cost of any history respects the initial budget x(A 2 )=(1,3) x(A 1 )=0 For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1.

9 GENERAL APPROACH

10 1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:

11 STATE OF THE ART 1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  ℙ multivariate Gaussian joint distribution  Entropy based criterion  Kriging variance  Greedy algorithm

12 1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  X discrete random vector  ℙ multivariate Gaussian joint distribution  ℙ Markov random field distribution  Entropy based criterions  Maximum Posterior Marginals (MPM)  Kriging variance OUR CONTRIBUTION  Greedy algorithm  Reinforcement learning

13 1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  X discrete random vector  ℙ multivariate Gaussian joint distribution  ℙ Markov random field distribution  Entropy based criterions  Maximum Posterior Marginals (MPM)  Kriging variance OUR CONTRIBUTION  Greedy algorithm  Reinforcement learning

14 Formulation using dynamic model An adapted framework for reinforcement learning

15  Summarize knowledge on X in a random vector S of length n Observe variables  update our knowledge on X  Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1

16  Summarize knowledge on X in a random vector S of length n Observe variables  update our knowledge on X  Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1

17 Reinforcement learning solution

18 Find optimal policy: The Q-function  Compute Q*  Compute * = « The expected value of the history when starting in s t, observing variables X(A t )and then following policy * »

19 Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) A1A1 A2A2 How to compute Q*: classical solution (Q-learning …) with proba 1- random with proba 1. Initialize Q 2. Simulate history

20 Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) Update Q(s 1,A 1 )Update Q(s 2,A 2 ) A1A1 A2A2 with proba 1- random with proba many times! How to compute Q*: classical solution (Q-learning …) 1. Initialize Q 2. Simulate history QQ* C.V

21 Alternative approach Linear approximation of Q-function: Choice of function Φ i :

22 LSDP Algorithm Linear approximation of Q-function:  Define weights for each decision step  Compute weights using “backward induction” H

23 LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of

24 LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of We fix|A t |=1 and use the approximation:

25 Experiments

26 Regular grid with first order neighbourhood. X(i) are binary variables. ℙ is a Potts model with β=0.5 Simple cost: observation of each variale cost 1 X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

27 Experiments Comparison between:  Random policy  BP-max heuristic: at each time step observed variable  LSPI policy “ common reinforcement learning algorithm”  LSDP policy using score:

28 Experiment: 100 variables (n=100)

29 LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals

30 LSDP et BP-max : one observation Action’s value for LSDP policy Max marginals

31 LSDP et BP-max : two observations Action’s value for LSDP policy Max marginals

32 Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !

33 Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !

34 Experiment: 100 variables - constraint move

35 Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38

36 Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38 Cost Repartition: RandomBP-maxLSDP

37 Experiment: 100 variables - Different cost RandomBP-MaxLSDP Policy Value65.3%66.2%67% Mean number of observed variables 15.4 WR 165.9%65.4%67% WR 061%63.2%64% Initial Budget = 30 Cost of 1 when observed variable is in state 0 Cost of 3 when observed variable is in state 1

38 Conclusions An adapted framework for adaptive sampling in discrete random variables LSDP: a reinforcement learning approach for finding near optimal policy  Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem  Computation of near optimal policy « off-line »  Design of new policies that outperform simple heuristics and usual RL method Possible application?  Weeds sampling in crop field

39

40 Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) =Reconstruction of X R Maximum Posterior Marginal for reconstruction:

41 Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) Maximum Posterior Marginal for reconstruction: Quality of trajectory:

42 LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 sHsH a H-1 aHaH U (s H+1 )

43 LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 U (s H+1 ) sHsH aHaH LINEAR SYSTEM wHwH

44 LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 LINEAR SYSTEM w H-1


Download ppt "A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:"

Similar presentations


Ads by Google