Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Markov Decision Process
Dynamic Decision Processes
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Artificial Learning Approaches for Multi-target Tracking Jesse McCrosky Nikki Hu.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Overview and Mathematics Bjoern Griesbach
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Markov Localization & Bayes Filtering
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
SIS Sequential Importance Sampling Advanced Methods In Simulation Winter 2009 Presented by: Chen Bukay, Ella Pemov, Amit Dvash.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Probabilistic Results for Mixed Criticality Real-Time Scheduling Bader N. Alahmad Sathish Gopalakrishnan.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
MDPs (cont) & Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Introduction and Preliminaries D Nagesh Kumar, IISc Water Resources Planning and Management: M4L1 Dynamic Programming and Applications.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Making complex decisions
POMDPs Logistics Outline No class Wed
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Markov Decision Processes
Dynamic Programming Lecture 13 (5/31/2017).
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
Using Manifold Structure for Partially Labeled Classification
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning Group June 10, 2005 Authors: Vikram Krishnamurthy & Robin Evans

Motivation Overview Multiarmed Bandits HMM Multiarmed Bandits Experimental Results Outline

ESA has only one steerable beam. The coordinates of each target evolve according to a finite state Markov chain. Question: which single target should the tracker choose to observe at each time instant in order to optimize some specified cost function? Motivation

Overview - How it works?

The Model One has N parallel projects, indexed i=1,2,…,N and at each instant of discrete time can work on only a single project. Let the state of project i at time k be denoted. If one works on project i at time k then one pays an immediate expected cost of. The state changes to by a Markov transition rule (which may depend upon i, but not upon t), while the state of the projects one has not touched remain unchanged: for.The problem is how to allocate one’s effort over projects sequentially in time so as to minimize expected total discounted cost. Multiarmed Bandits

Gittins Index Simplest non-trivial problem, classic No essential solution until Gittins and his co-workers. They proved that to each project i one could attach an index,,such that the optimal action at time k is to work on that project for which the current index is smallest. The index is calculated by solving the problem of allocating one’s effort optimally between project i and a standard project which yields a constant cost. Gittins’ result thus reduces the case of general N to that of the case N = 2.

HMM Multiarmed Bandits The “standard” multiarmed bandits problem involves a fully observed finite state Markov chain and is only a MDP with a rich structure. For the multitarget tracking, due to measurement noise at the sensor, the states are only partially observable. Thus, the multitarget tracking problem needs to be formulated as a multiarmed bandits involving HMMs (with the HMM filter to estimate the information state). Can be solved brute forcedly by POMDP, but it involves a much higher (enormous) dimensional Markov chain. Bandit assumption decouples the problem.

The information state of currently observed target updates by the HMM filter: For the other P-1 unobserved target, their information states are kept frozen: if target q is not observed Bandit Assumption

Why it is Valid? Slow Dynamics: slowly moving targets have a bandit structure. where Decoupling Approximation: without the bandit assumption, the optimal solution is intractable. Bandit model is perhaps the only reasonable approximation that leads to computationally tractable solution. Reinitialization: a compromise. Reinitialize the HMM multiarmed bandits at regular intervals with updated estimates from all targets.

Some details Finite State Markov Assumption: denotes the quantized distance of the p th target from base station, and the target distance evolves according to a finite-state Markov chain. Cost structure: typically depends on the distance of the p th target to the base station, i.e., the target gets close to the base station pose a greater threat and given higher priority by the tracking algorithm. Objective function:

Optimal Solution For the bandit assumption, the optimal solution has an indexable (decoupling) rule, that is, the optimization can be decoupled into P independent optimization problems. For each target p, there is a function (Gittins index). Solved by POMDP algorithms, see the next slide. The optimal scheduling policy at time k is to steer the beam toward the target with the smallest Gittins index

Gittins Index For arbitrary multiarmed bandits problem, the Gittins index can be calculated by solving an associated infinite horizon discounted control problem called the “return to state”. For the target p, given information state at time k, there are two actions: 1) Continue, which incurs a cost and evolves according to HMM filter; 2) Restart, which moves to a fixed information state, incurs a cost, and evolves according to HMM filter.

The Gittins index of the state of target p is given by where satisfies the Bellman equation:

POMDP solver Defining new parameters (see eq.15), Can be solved by any standard POMDP solver: such as sondik’s algorithm, witness algorithm, incremental-prune, or suboptimal (approximated) algorithms.

Experimental Results