Download presentation
Presentation is loading. Please wait.
Published byMargaret Robbins Modified over 8 years ago
1
Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich
2
MDP Terminology Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return
3
Reinforcement Learning Learning only on environmental rewards –Achieve the best payoff possible Must balance exploitation with exploration –exploration can take large amounts of time The structure of the problem/model can assist in the exploration, in theory –But with what in our MDP case?
4
Goals/Approach Find MDP Characteristics... –... that affect performance... –... and test on them. Use MDP Characteristics... –... to tune parameters. –... to select algorithms. –... to create strategy.
5
Back to RL Undirected –Sufficient Exploration –Simple, but can be exponential Directed –Extra Computation/Storage, but possibly polynomial –Often uses aspects of the model to its advantage
6
RL Methods - Undirected -greedy exploration –Probability 1- of exploiting based on your best greedy guess at the moment –Explore with probability , select action (uniform) randomly Boltzman Distribution
7
RL Methods - Directed Maximize w/Exploration Bonuses –Different options for Counter-based (least frequently) Recency-based (most frequently) Error-based (most variable in estimation value) Interval Estimation (highest variance in samples)
8
Properties of MDPs State Transition Entropy Controllability Variance of Immediate Rewards Risk Factor Transition Distance Transition Variability
9
State Transition Entropy Stochasticity of State Transitions –High STE = good exploration Potential variance of samples needed –High STE = more samples needed
10
Controllability - Calculation How much the environment’s response differs for an action –Can also be thought of as normalized information gain
11
Controllability - Usage High controllability Control over actions Different actions lead to different parts of the space More variance = more sampling needed Take actions leading to controllable states Actions with Forward Controllability (FC)
12
Proposed Method Undirected –Explore w/ probability –For experiments K 1, K 2 = {0,1} = 1, = {0.1, 0.4, 0.9}
13
Proposed Method Directed –Pick action maximizing –For Experments K 0 = {1, 10, 50}, K 1, K 2 = {0,1}, K 3 = 1 is recency based
14
Experiments Random MDPs –225 states 3 actions 1-20 branching factor transition probs/rewards uniform [0,1] 0.01 chance of termination –Divided into 4 groups Low STE, High STE High variation (test) vs. low variation (control)
15
Experiments Continued Performance Measures –Return Estimates Run greedy policy from 50 different states, 30 trials per state, average returns, normalize –Penalty Measure R max = upper limit of return of optimal R t is normalized greedy return after trial t T = # of trials
16
Graphs, Glorious Graphs
17
More Graphs, Glorious Graphs
18
Discussion Significant results obtained when using STE and FC –Results correspond with presence of STC Values can be calculated prior to learning –Requires model knowledge Rug Sweeping and more judgements –SARSA
19
It’s over!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.