A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012

1. Problem statement. 2. General Approach. 3. Formulation using dynamic model. 4. Reinforcement learning solution. 5. Experiments. 6. Conclusions. Plan

PROBLEM STATEMENT Adaptive sampling X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9) Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A) -> Cost of observing variables X(A) B -> Initial budget Observations are reliable

PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategy / sampling policy to adaptively select variables to observe in order to : Optimize Quality of the reconstruction of X / Respect Initial Budget

PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategies / sampling policy to adaptively select variables to observed in order to : Optimize Quality of the reconstructed vector X / Respect Initial Budget

For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy

A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x(A 1 )=0 -- Example -- x(A 2 )=(1,3) For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

DEFINITIONS A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} Vocabulary : A history { (A 1,x(A 1 )),…,(A H,x(A H )} is a trajectory followed when applying : set of all reachable histories of c() ≤ B  cost of any history respects the initial budget x(A 2 )=(1,3) x(A 1 )=0 For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1.

GENERAL APPROACH

1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:

STATE OF THE ART 1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  ℙ multivariate Gaussian joint distribution  Entropy based criterion  Kriging variance  Greedy algorithm

1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  X discrete random vector  ℙ multivariate Gaussian joint distribution  ℙ Markov random field distribution  Entropy based criterions  Maximum Posterior Marginals (MPM)  Kriging variance OUR CONTRIBUTION  Greedy algorithm  Reinforcement learning

Formulation using dynamic model An adapted framework for reinforcement learning

 Summarize knowledge on X in a random vector S of length n Observe variables  update our knowledge on X  Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1

Reinforcement learning solution

Find optimal policy: The Q-function  Compute Q*  Compute * = « The expected value of the history when starting in s t, observing variables X(A t )and then following policy * »

Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) A1A1 A2A2 How to compute Q*: classical solution (Q-learning …) with proba 1- random with proba 1. Initialize Q 2. Simulate history

Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) Update Q(s 1,A 1 )Update Q(s 2,A 2 ) A1A1 A2A2 with proba 1- random with proba many times! How to compute Q*: classical solution (Q-learning …) 1. Initialize Q 2. Simulate history QQ* C.V

Alternative approach Linear approximation of Q-function: Choice of function Φ i :

LSDP Algorithm Linear approximation of Q-function:  Define weights for each decision step  Compute weights using “backward induction” H

LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of

LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of We fix|A t |=1 and use the approximation:

Experiments

Regular grid with first order neighbourhood. X(i) are binary variables. ℙ is a Potts model with β=0.5 Simple cost: observation of each variale cost 1 X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

Experiments Comparison between:  Random policy  BP-max heuristic: at each time step observed variable  LSPI policy “ common reinforcement learning algorithm”  LSDP policy using score:

Experiment: 100 variables (n=100)

LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals

LSDP et BP-max : one observation Action’s value for LSDP policy Max marginals

LSDP et BP-max : two observations Action’s value for LSDP policy Max marginals

Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !

Experiment: 100 variables - constraint move

Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38

Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables 26.651927.338 Initial Budget = 38 Cost Repartition: RandomBP-maxLSDP

Experiment: 100 variables - Different cost RandomBP-MaxLSDP Policy Value65.3%66.2%67% Mean number of observed variables 15.4 WR 165.9%65.4%67% WR 061%63.2%64% Initial Budget = 30 Cost of 1 when observed variable is in state 0 Cost of 3 when observed variable is in state 1

Conclusions An adapted framework for adaptive sampling in discrete random variables LSDP: a reinforcement learning approach for finding near optimal policy  Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem  Computation of near optimal policy « off-line »  Design of new policies that outperform simple heuristics and usual RL method Possible application?  Weeds sampling in crop field

Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) =Reconstruction of X R Maximum Posterior Marginal for reconstruction:

Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) Maximum Posterior Marginal for reconstruction: Quality of trajectory:

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 sHsH a H-1 aHaH U (s H+1 )

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 U (s H+1 ) sHsH aHaH LINEAR SYSTEM wHwH

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 LINEAR SYSTEM w H-1

A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Similar presentations

Presentation on theme: "A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Similar presentations

Presentation on theme: "A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:"— Presentation transcript:

Similar presentations

About project

Feedback