Download presentation
Presentation is loading. Please wait.
Published byMyron Anderson Modified over 9 years ago
1
presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied Economics) Inverse reinforcement learning(IRL) approach to modeling pastoralist movements
2
Outline Background Reinforcement Learning Inverse Reinforcement Learning (IRL) Pastoral Problem Model Results
3
Pastoralists of Africa Survey was collected by the USAID Global Livestock Collaborative Research Support Program (GL CRSP) under PARIMA project by Prof C. B. Barrett. The project focuses on six locations in Northern Kenya and Southern Ethiopia We wish to explain the movement over time and space of animal herders The movement of herders is due to highly variable rainfall: in between winters and summers, there are dry seasons, with virtually no precipitation. Herds migrate to remote water points. Pastoralists can suffer greatly by droughts, by losing large portions of their herds We are interested in the herders spatiotemporal movement problem to understand the incentives on which they base their decisions. This can help form policies (control of grazing, drilling water points)
4
Reinforcement Learning Common form of learning among animals. Agent interacts with an environment (takes an action) Transitions into a new state Gets a positive or negative reward
5
Reinforcement Learning Goal: Pick actions over time so as to maximize the expected score: E[ R ( s 0 ) + R ( s 1 ) + … + R ( s T )] Solution: policy which specifies an action for each possible state Reward Function R (s) Reinforcement Learning Reinforcement Learning Optimal policy Environment Model (MDP)
6
Reward Function R (s) Inverse Reinforcement Learning (IRL) Inverse Reinforcement Learning (IRL) Optimal policy Environment Model (MDP) Inverse Reinforcement Learning Expert Trajectories s 0, a 0, s 1, a 1, s 2, a 2 … R that explains expert trajectories
7
Reinforcement Learning MDP is represented as a tuple (S, A, {P sa },,R) R is bounded by R max Value function for policy : Q-function:
8
Bellman Equation: Bellman Optimality:
9
Inverse Reinforcement Learning Linear approximation of reward function some using basis functions Let be value function of policy, when reward R = For computing R that makes optimal
10
Inverse Reinforcement Learning Expert policy is only accessible through a set of sampled trajectories For a trajectory state sequence (s 0, s 1, s 2 ….): Considering just the i th basis function Note that this is the sum of discounted features along a trajectory Estimated value will be :
11
Inverse Reinforcement Learning Assume we have some set of policies Linear Programming formulation The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies reiterate (Andrew Ng & Struat Russell, 2000)
12
Find R s.t. R is consistent with the teacher’s policy * being optimal. Find R s.t.: Find t,w : Apprenticeship learning to recover R (Pieter Abbeel & Andrew Ng, 2004)
13
Pastoral Problem We have data describing: Household information from 5 villages Lat, Long information of all water points(311) and villages All the water points visited over the last quarter by a sub herd Time spent at each water point Estimated capacity of the water point Vegetation information around the water points Herd sizes and types We have been able to generate around 1750 expert trajectories described over a period of 3 months (~90 days)
14
All water points
15
All trajectories
16
Sample trajectories
17
State Space Model: State is uniquely identified by geographical location (long, lat) and the herd size. S = (Long, Lat, Herd) A = (Stay, Move to adjacent cell on the grid) 2 nd option for Model S = (wp1, wp2, Herd) A = (stay at same edge, move to another edge) Larger State space
18
Modeling Reward Linear Model R(s) = th * [veg(long,lat), cap(long,lat), herd_size, is_village(long,lat), … interaction_terms]; Normalized values of veg, cap, herd_size RBF Model 30 basis functions f i (s) R(s) = sum (th i * f i (s)) i = 1,2,…30 s = veg(long,lat), cap(long,lat), herd_size, is_village(long,lat)
19
Toy Problem Used exactly the same model Pre-defined the weights th, got a reward function R(s) Used a synthetic generator to generate expert policy and trajectories Ran IRL to generate a reward function Compared computed reward with known reward
20
Toy Problem
21
Linear Reward Model - recovered from pastoral trajectories
23
RBF Reward Model - recovered from pastoral trajectories
24
Currently working on … Including time as another dimension in the state space Specifying a performance metric for recovered reward function Cross validation Specifying a better/novel reward function
25
Thank You Questions / Comments
27
Weights computed by running IRL on the actual problem
29
Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T (s) such that under R(s) the expert performs better than all previously found policies { i }. RL step: Compute optimal policy t for the estimated reward w. Courtesy of Pieter Abbeel
30
Algorithm: IRL step Maximize , w:||w|| 2 ≤ 1 s.t. V w ( E ) V w ( i ) + i=1,…,t-1 = margin of expert’s performance over the performance of previously found policies. V w ( ) = E [ t t R(s t )| ] = E [ t t w T (s t )| ] = w T E [ t t (s t )| ] = w T ( ) ( ) = E [ t t (s t )| ] are the “feature expectations” Courtesy of Pieter Abbeel
31
Feature Expectation Closeness and Performance If we can find a policy such that || ( E ) - ( )|| 2 , then for any underlying reward R*(s) =w* T (s), we have that |V w* ( E ) - V w* ( )| = |w* T ( E ) - w* T ( )| ||w*|| 2 || ( E ) - ( )|| 2 . Courtesy of Pieter Abbeel
32
Algorithm For i = 1, 2, … Inverse RL step: RL step: (= constraint generation) Compute optimal policy i for the estimated reward R w.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.