Download presentation
Presentation is loading. Please wait.
Published byFelix Tolman Modified over 10 years ago
1
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State University
2
A Markov Decision Process (MDP) is a tuple where is the set of states is the set of actions is the transition function denoting probability of transitioning to state after taking action in is the reward function giving reward in state is the initial state A stationary policy is a mapping from states to actions The H-horizon value of a policy is the expected total reward of trajectories that start at and follow for H steps
3
Teacher Learner Trajectory Data Supervised Learning Algorithm Classifier GOAL: To learn a policy whose H-horizon value is not much worse than
4
Teacher Trajectory Data Supervised Learning Algorithm Learner DRAWBACK: Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents!! Classifier
5
Learner Select Best State Query Teacher correct action to take in is Current Training data (s, a) pairs Simulator
6
Learner Select Best State Query Teacher This is a bad state which I would never visit!! I choose not to suggest any action Bad State( ) Current Training data (s, a) pairs Simulator
7
Select Best State Query Wargus Expert Wargus Agent A bad state query!! Bad State( ) Current Training data (s, a) pairs Simulator
8
Select Best State Query Expert Pilot Helicopter Flying Agent A bad state query!! Current Training data (s, a) pairs Simulator Bad State( )
9
It is important to minimize bad state queries!! Learner Select Best State Query Teacher correct action to take in is Challenge: how to combine action uncertainty and bad- state likelihood We provide a principled approach based on noiseless Bayesian active learning Current Training data (s, a) pairs Simulator
10
It is possible to simulate passive imitation learning via state queries Supervised Learning Algorithm Trajectory Data N
11
Learner Select Best State Query Teacher correct action to take in is Single known target distribution Current Training data (s, a) pairs Simulator
12
Select Best State Query Teacher correct action to take in is Single known target distribution Applying i.i.d. active learning uniformly over entire state space leads to poor performance: Queries are in uncertain states that are also bad!! Current Training data (s, a) pairs Simulator Learner
13
Goal: identify true hypothesis with as few tests as possible We employ a form of generalized binary search (GBS) in this work Hypotheses Tests Test Outcomes
17
GOAL: Determine the path corresponding to by performing test from that have outcomes (teacher responses)
27
Bootstrap Sample 1 Bootstrap Sample 2 Bootstrap Sample 3 Bootstrap Sample K Generalized Binary Search Supervised Learner Supervised Learner Supervised Learner Supervised Learner Simulator Path 1Path 2Path 3Path K Labeled Data (s, a) pairs
28
can be rewritten in the following form: Posterior prob. mass of hypotheses that go through s Entropy of multinomial distribution over actions at s Small bonus that is maximized when Posterior prob. of target policy visiting s Uncertainty over action choices at s
29
We use Pegasus style determinization approach to handle stochastic MDPs (Ng & Jordan, UAI 2000) Details are in the paper!!
30
We performed experiments in two domains: A grid world with pits Cart pole We compared IQBC against following baselines : Random: Selects states to query uniformly at random Standard QBC (SQBC): Treats all states as i.i.d. and applies standard uncertainty based QBC Passive imitation learning (Passive): Simulates standard passive imitation learning Confidence based autonomy (CBA) (Chernova & Veloso, JAIR 2009): Executes policy until the confidence falls below an automatically adjusted threshold, at which point the learner queries the teacher for an action, updates its policy and threshold and resumes execution Performance can be quite sensitive to threshold adjustment
31
30 Pit Goal
32
Generous: always responds with an action Strict: declares states far away from the states visited by the teacher as bad states
33
“Generous” teacher
34
“Strict” teacher
36
state = Actions = left or right Bounds on cart position and pole angle are [-2.4, 2.4] and [-90, 90] resp.
37
“Generous” teacher
38
“Strict” teacher
39
Develop policy optimization algorithms that take responses and other forms of teacher input Query short sequence of states rather than single states Consider more application areas like structured prediction and other RL domains Conduct studies with human teachers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.