Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009.

Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009

Background Challenges in Reinforcement Learning Issue of Primary Importance and Much Researched Crucial in Dynamic Environment Tend to learn slowly Tradeoff between Exploration and Exploitation Analogous to tradeoff between system control and system identification in Optimal Control

Question Do we try new actions to find out if they have a good reward? Do what looks best, or see if something else is really best. Or Do we just keep to the actions we have already learnt to have good rewards? What action(s) are responsible for a reward? Requires solving credit assignment

To maximize expected total reward the agent must prefer actions that it has tried in the past and found to be effective. To discover – it has to take actions that has not been taken in the past. A good balance between an exhaustive exploration of the environment and the exploitation of the learned policy is fundamental to reach nearly optimal solutions in few learning episodes, thus enhancing the learning performance.

In Dynamic Environment Two key issues Dealing with moving obstacles Dealing with terrain that changes over time The problem becomes more challenging because Currently exploited solution may no longer be valid Solutions previously explored may have changed in value

Approaches Incorporation of a Forgetting Mechanism into Q- Learning Feature Based Reinforcement Learning Hierarchical Reinforcement Learning

Forgetting Mechanism Decaying forgetting term Removing over-dependence on a specific set of solution (CGA) Exploration emphasized more than Exploitation (RL) Avoids making use of outdated knowledge 3 Concepts integrated Penalty based value function Action Selection Policy Forgetting Mechanism

Penalty Based Value Function Value function maintained over set of states rather than set of state action pairs Adaptation of Q-Learning to an environment where the resultant states of a state-action pair are deterministic rather than probabilistic Necessary to store the values of individual states and this is maintained as a penalty function, which tracks the expected total cost associated with being in the given state

As agent explores, learns the associated penalty approximated by the value function for that state. Value function for the visited state is

Action Selection Policy Selecting an action a that minimizes the penalty Greedy Policy – k chosen such that value of P(S,k) is minimized

Forgetting Mechanism Slow decay of state value function Enhance exploration Maintain a diversity of possible solutions Forgetting penalty associated with a state previously determined to be suboptimal – allows to explore states that would otherwise be ignored Applied to value function after each episode

Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009.

Similar presentations

Presentation on theme: "Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009.

Similar presentations

Presentation on theme: "Presentation By SANJOG BHATTA Student ID : 20091143 July 28’ 2009."— Presentation transcript:

Similar presentations

About project

Feedback