Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Batch RL Via Least Squares Policy Iteration

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Decision Theoretic Planning

Reinforcement Learning

1. Algorithms for Inverse Reinforcement Learning 2

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

Discretization Pieter Abbeel UC Berkeley EECS

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Collaborative Reinforcement Learning Presented by Dr. Ying Lu.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

Objectives: To model the spatio-temporal herd allocation choices of pastoralists (livestock herders) in the arid and semi-arid lands (ASAL) of east Africa.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,

Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Department of Computer Science Undergraduate Events More

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Generative Adversarial Imitation Learning

István Szita & András Lőrincz

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

Apprenticeship Learning Using Linear Programming

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CMSC 471 Fall 2009 RL using Dynamic Programming

Apprenticeship Learning via Inverse Reinforcement Learning

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied Economics) Inverse reinforcement learning(IRL) approach to modeling pastoralist movements

Outline Background Reinforcement Learning Inverse Reinforcement Learning (IRL) Pastoral Problem Model Results

Pastoralists of Africa Survey was collected by the USAID Global Livestock Collaborative Research Support Program (GL CRSP) under PARIMA project by Prof C. B. Barrett. The project focuses on six locations in Northern Kenya and Southern Ethiopia We wish to explain the movement over time and space of animal herders The movement of herders is due to highly variable rainfall: in between winters and summers, there are dry seasons, with virtually no precipitation. Herds migrate to remote water points. Pastoralists can suffer greatly by droughts, by losing large portions of their herds We are interested in the herders spatiotemporal movement problem to understand the incentives on which they base their decisions. This can help form policies (control of grazing, drilling water points)

Reinforcement Learning Common form of learning among animals. Agent interacts with an environment (takes an action) Transitions into a new state Gets a positive or negative reward

Reinforcement Learning Goal: Pick actions over time so as to maximize the expected score: E[ R ( s 0 ) + R ( s 1 ) + … + R ( s T )] Solution: policy  which specifies an action for each possible state Reward Function R (s) Reinforcement Learning Reinforcement Learning Optimal policy  Environment Model (MDP)

Reward Function R (s) Inverse Reinforcement Learning (IRL) Inverse Reinforcement Learning (IRL) Optimal policy  Environment Model (MDP) Inverse Reinforcement Learning Expert Trajectories s 0, a 0, s 1, a 1, s 2, a 2 … R that explains expert trajectories

Reinforcement Learning MDP is represented as a tuple (S, A, {P sa },,R) R is bounded by R max Value function for policy : Q-function:

Bellman Equation: Bellman Optimality:

Inverse Reinforcement Learning Linear approximation of reward function some using basis functions Let be value function of policy, when reward R = For computing R that makes optimal

Inverse Reinforcement Learning Expert policy is only accessible through a set of sampled trajectories For a trajectory state sequence (s 0, s 1, s 2 ….): Considering just the i th basis function Note that this is the sum of discounted features along a trajectory Estimated value will be :

Inverse Reinforcement Learning Assume we have some set of policies Linear Programming formulation The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies reiterate (Andrew Ng & Struat Russell, 2000)

Find R s.t. R is consistent with the teacher’s policy  * being optimal. Find R s.t.: Find t,w : Apprenticeship learning to recover R (Pieter Abbeel & Andrew Ng, 2004)

Pastoral Problem We have data describing: Household information from 5 villages Lat, Long information of all water points(311) and villages All the water points visited over the last quarter by a sub herd Time spent at each water point Estimated capacity of the water point Vegetation information around the water points Herd sizes and types We have been able to generate around 1750 expert trajectories described over a period of 3 months (~90 days)

All water points

All trajectories

Sample trajectories

State Space Model: State is uniquely identified by geographical location (long, lat) and the herd size. S = (Long, Lat, Herd) A = (Stay, Move to adjacent cell on the grid) 2 nd option for Model S = (wp1, wp2, Herd) A = (stay at same edge, move to another edge) Larger State space

Modeling Reward Linear Model R(s) = th * [veg(long,lat), cap(long,lat), herd_size, is_village(long,lat), … interaction_terms]; Normalized values of veg, cap, herd_size RBF Model 30 basis functions f i (s) R(s) = sum (th i * f i (s)) i = 1,2,…30 s = veg(long,lat), cap(long,lat), herd_size, is_village(long,lat)

Toy Problem Used exactly the same model Pre-defined the weights th, got a reward function R(s) Used a synthetic generator to generate expert policy and trajectories Ran IRL to generate a reward function Compared computed reward with known reward

Toy Problem

Linear Reward Model - recovered from pastoral trajectories

RBF Reward Model - recovered from pastoral trajectories

Currently working on … Including time as another dimension in the state space Specifying a performance metric for recovered reward function Cross validation Specifying a better/novel reward function

Thank You Questions / Comments

Weights computed by running IRL on the actual problem

Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T  (s) such that under R(s) the expert performs better than all previously found policies {  i }. RL step: Compute optimal policy  t for the estimated reward w. Courtesy of Pieter Abbeel

Algorithm: IRL step Maximize , w:||w|| 2 ≤ 1  s.t. V w (  E )  V w (  i ) +  i=1,…,t-1  = margin of expert’s performance over the performance of previously found policies. V w (  ) = E [  t  t R(s t )|  ] = E [  t  t w T  (s t )|  ] = w T E [  t  t  (s t )|  ] = w T  (  )  (  ) = E [  t  t  (s t )|  ] are the “feature expectations” Courtesy of Pieter Abbeel

Feature Expectation Closeness and Performance If we can find a policy  such that ||  (  E ) -  (  )|| 2  , then for any underlying reward R*(s) =w* T  (s), we have that |V w* (  E ) - V w* (  )| = |w* T  (  E ) - w* T  (  )|  ||w*|| 2 ||  (  E ) -  (  )|| 2  . Courtesy of Pieter Abbeel

Algorithm For i = 1, 2, … Inverse RL step: RL step: (= constraint generation) Compute optimal policy  i for the estimated reward R w.