Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich

MDP Terminology Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return

Reinforcement Learning Learning only on environmental rewards –Achieve the best payoff possible Must balance exploitation with exploration –exploration can take large amounts of time The structure of the problem/model can assist in the exploration, in theory –But with what in our MDP case?

Goals/Approach Find MDP Characteristics... –... that affect performance... –... and test on them. Use MDP Characteristics... –... to tune parameters. –... to select algorithms. –... to create strategy.

Back to RL Undirected –Sufficient Exploration –Simple, but can be exponential Directed –Extra Computation/Storage, but possibly polynomial –Often uses aspects of the model to its advantage

RL Methods - Undirected  -greedy exploration –Probability 1-  of exploiting based on your best greedy guess at the moment –Explore with probability , select action (uniform) randomly Boltzman Distribution

RL Methods - Directed Maximize w/Exploration Bonuses –Different options for  Counter-based (least frequently) Recency-based (most frequently) Error-based (most variable in estimation value) Interval Estimation (highest variance in samples)

Properties of MDPs State Transition Entropy Controllability Variance of Immediate Rewards Risk Factor Transition Distance Transition Variability

State Transition Entropy Stochasticity of State Transitions –High STE = good exploration Potential variance of samples needed –High STE = more samples needed

Controllability - Calculation How much the environment’s response differs for an action –Can also be thought of as normalized information gain

Controllability - Usage High controllability Control over actions Different actions lead to different parts of the space More variance = more sampling needed Take actions leading to controllable states Actions with Forward Controllability (FC)

Proposed Method Undirected –Explore w/ probability  –For experiments K 1, K 2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}

Proposed Method Directed –Pick action maximizing –For Experments K 0 = {1, 10, 50}, K 1, K 2 = {0,1}, K 3 = 1  is recency based

Experiments Random MDPs –225 states 3 actions 1-20 branching factor transition probs/rewards uniform [0,1] 0.01 chance of termination –Divided into 4 groups Low STE, High STE High variation (test) vs. low variation (control)

Experiments Continued Performance Measures –Return Estimates Run greedy policy from 50 different states, 30 trials per state, average returns, normalize –Penalty Measure R max = upper limit of return of optimal R t is normalized greedy return after trial t T = # of trials

Graphs, Glorious Graphs

More Graphs, Glorious Graphs

Discussion Significant results obtained when using STE and FC –Results correspond with presence of STC Values can be calculated prior to learning –Requires model knowledge Rug Sweeping and more judgements –SARSA

It’s over!

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Similar presentations

Presentation on theme: "Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Similar presentations

Presentation on theme: "Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas."— Presentation transcript:

Similar presentations

About project

Feedback