Download presentation
Presentation is loading. Please wait.
Published byCharles Davis Modified over 8 years ago
1
Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009
2
Background Challenges in Reinforcement Learning Issue of Primary Importance and Much Researched Crucial in Dynamic Environment Tend to learn slowly Tradeoff between Exploration and Exploitation Analogous to tradeoff between system control and system identification in Optimal Control
3
Question Do we try new actions to find out if they have a good reward? Do what looks best, or see if something else is really best. Or Do we just keep to the actions we have already learnt to have good rewards? What action(s) are responsible for a reward? Requires solving credit assignment
4
To maximize expected total reward the agent must prefer actions that it has tried in the past and found to be effective. To discover – it has to take actions that has not been taken in the past. A good balance between an exhaustive exploration of the environment and the exploitation of the learned policy is fundamental to reach nearly optimal solutions in few learning episodes, thus enhancing the learning performance.
5
In Dynamic Environment Two key issues Dealing with moving obstacles Dealing with terrain that changes over time The problem becomes more challenging because Currently exploited solution may no longer be valid Solutions previously explored may have changed in value
6
Approaches Incorporation of a Forgetting Mechanism into Q- Learning Feature Based Reinforcement Learning Hierarchical Reinforcement Learning
7
Forgetting Mechanism Decaying forgetting term Removing over-dependence on a specific set of solution (CGA) Exploration emphasized more than Exploitation (RL) Avoids making use of outdated knowledge 3 Concepts integrated Penalty based value function Action Selection Policy Forgetting Mechanism
8
Penalty Based Value Function Value function maintained over set of states rather than set of state action pairs Adaptation of Q-Learning to an environment where the resultant states of a state-action pair are deterministic rather than probabilistic Necessary to store the values of individual states and this is maintained as a penalty function, which tracks the expected total cost associated with being in the given state
9
As agent explores, learns the associated penalty approximated by the value function for that state. Value function for the visited state is
10
Action Selection Policy Selecting an action a that minimizes the penalty Greedy Policy – k chosen such that value of P(S,k) is minimized
11
Forgetting Mechanism Slow decay of state value function Enhance exploration Maintain a diversity of possible solutions Forgetting penalty associated with a state previously determined to be suboptimal – allows to explore states that would otherwise be ignored Applied to value function after each episode
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.