A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Slides:

Advertisements

Similar presentations

Mean-Field Theory and Its Applications In Computer Vision1 1.

Advertisements

Bayesian Belief Propagation

G5BAIM Artificial Intelligence Methods

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Dynamic Bayesian Networks (DBNs)

1. Algorithms for Inverse Reinforcement Learning 2

Minimum Redundancy and Maximum Relevance Feature Selection

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Lecture 3 Probability and Measurement Error, Part 2.

Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.

Smoothing 3D Meshes using Markov Random Fields

Visual Recognition Tutorial

Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.

Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy LSU ---- Geaux Tigers! April 2009.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy Psychiatric Biostatistics Symposium May 2009.

Maximum Likelihood (ML), Expectation Maximization (EM)

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Based on slides by Nicholas Roy, MIT Finding Approximate POMDP Solutions through Belief Compression.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Particle Filtering in Network Tomography

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Thinning Measurement Models and Questionnaire Design Ricardo Silva University College London CSML PI Meeting, January 2013.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Solving POMDPs through Macro Decomposition

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.

Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Mixture Kalman Filters by Rong Chen & Jun Liu Presented by Yusong Miao Dec. 10, 2003.

Distributed Q Learning Lars Blackmore and Steve Block.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Bayesian Brain: Probabilistic Approaches to Neural Coding Chapter 12: Optimal Control Theory Kenju Doya, Shin Ishii, Alexandre Pouget, and Rajesh P.N.Rao.

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Biointelligence Laboratory, Seoul National University

Keep the Adversary Guessing: Agent Security by Policy Randomization

Figure 5: Change in Blackjack Posterior Distributions over Time.

Data Transformation: Normalization

Probability Theory and Parameter Estimation I

C.-S. Shieh, EC, KUAS, Taiwan

2018/9/16 Distributed Source Coding Using Syndromes (DISCUS): Design and Construction S.Sandeep Pradhan, Kannan Ramchandran IEEE Transactions on Information.

"Playing Atari with deep reinforcement learning."

Thinning Measurement Models and Questionnaire Design

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Filtering and State Estimation: Basic Concepts

Instructors: Fei Fang (This Lecture) and Dave Touretzky

The Most General Markov Substitution Model on an Unrooted Tree

Dr. Arslan Ornek DETERMINISTIC OPTIMIZATION MODELS

Reinforcement Learning (2)

Presentation transcript:

A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse MSTGA, Bordeaux,

1. Problem statement. 2. General Approach. 3. Formulation using dynamic model. 4. Reinforcement learning solution. 5. Experiments. 6. Conclusions. Plan

PROBLEM STATEMENT Adaptive sampling X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9) Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A) -> Cost of observing variables X(A) B -> Initial budget Observations are reliable

PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategy / sampling policy to adaptively select variables to observe in order to : Optimize Quality of the reconstruction of X / Respect Initial Budget

PROBLEM STATEMENT Adaptive sampling Adaptive selection of variables to observe for reconstruction of the random vector X= ( X(1),….,X(n) ) c(A,x(A)) -> Cost of observing variables X(A) in state x(A) B -> Initial budget Observations are reliable Problem: Find strategies / sampling policy to adaptively select variables to observed in order to : Optimize Quality of the reconstructed vector X / Respect Initial Budget

For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy

A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x(A 1 )=0 -- Example -- x(A 2 )=(1,3) For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1. DEFINITION: Adaptive sampling policy X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

DEFINITIONS A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} Vocabulary : A history { (A 1,x(A 1 )),…,(A H,x(A H )} is a trajectory followed when applying : set of all reachable histories of c() ≤ B  cost of any history respects the initial budget x(A 2 )=(1,3) x(A 1 )=0 For any sampling plans A 1,…,A t and observations x(A 1 ),…, x(A t ), an adaptive sampling policy is a function giving the next variable(s) to observe: ( ( A 1,x(A 1 )),…, ( A t,x(A t )) ) =A t+1.

GENERAL APPROACH

1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:

STATE OF THE ART 1.Find a distribution ℙ that well describes the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  ℙ multivariate Gaussian joint distribution  Entropy based criterion  Kriging variance  Greedy algorithm

1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  X discrete random vector  ℙ multivariate Gaussian joint distribution  ℙ Markov random field distribution  Entropy based criterions  Maximum Posterior Marginals (MPM)  Kriging variance OUR CONTRIBUTION  Greedy algorithm  Reinforcement learning

1.Find a distribution ℙ that well desribed the phenomenon under study. 2.Define the value of adaptive sampling policy: 3.Define approximate resolution method for finding near optimal policy:  X continuous random vector  X discrete random vector  ℙ multivariate Gaussian joint distribution  ℙ Markov random field distribution  Entropy based criterions  Maximum Posterior Marginals (MPM)  Kriging variance OUR CONTRIBUTION  Greedy algorithm  Reinforcement learning

Formulation using dynamic model An adapted framework for reinforcement learning

 Summarize knowledge on X in a random vector S of length n Observe variables  update our knowledge on X  Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1

 Summarize knowledge on X in a random vector S of length n Observe variables  update our knowledge on X  Evolution of S Example: s = ( -1, …….., k, …….., -1 ) Variable X(i) was observed in state k i s1s1 s2s2 s3s3 U(s H+1 ) A1A1 A2A2 A3A3 s H+1

Reinforcement learning solution

Find optimal policy: The Q-function  Compute Q*  Compute * = « The expected value of the history when starting in s t, observing variables X(A t )and then following policy * »

Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) A1A1 A2A2 How to compute Q*: classical solution (Q-learning …) with proba 1- random with proba 1. Initialize Q 2. Simulate history

Find optimal policy: The Q-function s1s1 s2s2 s3s3 U(s 3 ) Update Q(s 1,A 1 )Update Q(s 2,A 2 ) A1A1 A2A2 with proba 1- random with proba many times! How to compute Q*: classical solution (Q-learning …) 1. Initialize Q 2. Simulate history QQ* C.V

Alternative approach Linear approximation of Q-function: Choice of function Φ i :

LSDP Algorithm Linear approximation of Q-function:  Define weights for each decision step  Compute weights using “backward induction” H

LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of

LSDP Algorithm: application to sampling 1.Computation of Φ i (s t,A t ): 2.Computation of 3.Computation of We fix|A t |=1 and use the approximation:

Experiments

Regular grid with first order neighbourhood. X(i) are binary variables. ℙ is a Potts model with β=0.5 Simple cost: observation of each variale cost 1 X(7) X(6)X(5) X(1) X(4) X(3)X(2) X(8)X(9)

Experiments Comparison between:  Random policy  BP-max heuristic: at each time step observed variable  LSPI policy “ common reinforcement learning algorithm”  LSDP policy using score:

Experiment: 100 variables (n=100)

LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals

LSDP et BP-max : one observation Action’s value for LSDP policy Max marginals

LSDP et BP-max : two observations Action’s value for LSDP policy Max marginals

Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !

Experiment: 100 variables - constraint move Allowed to visit second ordre neighbourood only !

Experiment: 100 variables - constraint move

Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables Initial Budget = 38

Experiment: 200 variables - Different cost RandomBP- Max LSDPMin Cost Policy Value 60.27%61.77%64.8%64.58% Mean number of observed variables Initial Budget = 38 Cost Repartition: RandomBP-maxLSDP

Experiment: 100 variables - Different cost RandomBP-MaxLSDP Policy Value65.3%66.2%67% Mean number of observed variables 15.4 WR 165.9%65.4%67% WR 061%63.2%64% Initial Budget = 30 Cost of 1 when observed variable is in state 0 Cost of 3 when observed variable is in state 1

Conclusions An adapted framework for adaptive sampling in discrete random variables LSDP: a reinforcement learning approach for finding near optimal policy  Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem  Computation of near optimal policy « off-line »  Design of new policies that outperform simple heuristics and usual RL method Possible application?  Weeds sampling in crop field

Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) =Reconstruction of X R Maximum Posterior Marginal for reconstruction:

Reconstruction of X(R) and trajectory value A 1 = 1 ={8} A 2 = 2 ( (8,0) )={6,7} x A 1 =0 x A 2 =(4,2) Maximum Posterior Marginal for reconstruction: Quality of trajectory:

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 sHsH a H-1 aHaH U (s H+1 )

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s H+1 s1s1 s2s2 s H-1 U (s H+1 ) a1a1 a2a2 sHsH a H-1 aHaH s H+1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 U (s H+1 ) sHsH aHaH LINEAR SYSTEM wHwH

LSDP Algorithm Linear approximation of Q-function: How to comput w 1,…,w H : s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 s1s1 s2s2 s H-1 a1a1 a2a2 a H-1 LINEAR SYSTEM w H-1