Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Similar presentations


Presentation on theme: "Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas."— Presentation transcript:

1 Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich

2 MDP Terminology Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return

3 Reinforcement Learning Learning only on environmental rewards –Achieve the best payoff possible Must balance exploitation with exploration –exploration can take large amounts of time The structure of the problem/model can assist in the exploration, in theory –But with what in our MDP case?

4 Goals/Approach Find MDP Characteristics... –... that affect performance... –... and test on them. Use MDP Characteristics... –... to tune parameters. –... to select algorithms. –... to create strategy.

5 Back to RL Undirected –Sufficient Exploration –Simple, but can be exponential Directed –Extra Computation/Storage, but possibly polynomial –Often uses aspects of the model to its advantage

6 RL Methods - Undirected  -greedy exploration –Probability 1-  of exploiting based on your best greedy guess at the moment –Explore with probability , select action (uniform) randomly Boltzman Distribution

7 RL Methods - Directed Maximize w/Exploration Bonuses –Different options for  Counter-based (least frequently) Recency-based (most frequently) Error-based (most variable in estimation value) Interval Estimation (highest variance in samples)

8 Properties of MDPs State Transition Entropy Controllability Variance of Immediate Rewards Risk Factor Transition Distance Transition Variability

9 State Transition Entropy Stochasticity of State Transitions –High STE = good exploration Potential variance of samples needed –High STE = more samples needed

10 Controllability - Calculation How much the environment’s response differs for an action –Can also be thought of as normalized information gain

11 Controllability - Usage High controllability Control over actions Different actions lead to different parts of the space More variance = more sampling needed Take actions leading to controllable states Actions with Forward Controllability (FC)

12 Proposed Method Undirected –Explore w/ probability  –For experiments K 1, K 2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}

13 Proposed Method Directed –Pick action maximizing –For Experments K 0 = {1, 10, 50}, K 1, K 2 = {0,1}, K 3 = 1  is recency based

14 Experiments Random MDPs –225 states 3 actions 1-20 branching factor transition probs/rewards uniform [0,1] 0.01 chance of termination –Divided into 4 groups Low STE, High STE High variation (test) vs. low variation (control)

15 Experiments Continued Performance Measures –Return Estimates Run greedy policy from 50 different states, 30 trials per state, average returns, normalize –Penalty Measure R max = upper limit of return of optimal R t is normalized greedy return after trial t T = # of trials

16 Graphs, Glorious Graphs

17 More Graphs, Glorious Graphs

18 Discussion Significant results obtained when using STE and FC –Results correspond with presence of STC Values can be calculated prior to learning –Requires model knowledge Rug Sweeping and more judgements –SARSA

19 It’s over!


Download ppt "Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas."

Similar presentations


Ads by Google