Download presentation
Presentation is loading. Please wait.
Published byEncarnación Revuelta Lozano Modified over 6 years ago
1
Dr. Unnikrishnan P.C. Professor, EEE
EE368 Soft Computing Dr. Unnikrishnan P.C. Professor, EEE
2
Module II Reinforcement Learning
3
Reinforcement Learning
Consider the game Breakout. In this game we control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.
4
Reinforcement Learning…..
Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). This is a typical classification problem – for each game screen you have to decide, whether you should move left, right or press fire Of course we can go and record game sessions using expert players, but that’s not really how we learn.
5
Reinforcement Learning…..
This is the task reinforcement learning tries to solve. RL lies somewhere in between supervised and unsupervised learning. When you hit a brick and score a reward in the game, it often has nothing to do with the actions (paddle movements) you did just before getting the reward. All the hard work was already done, when you positioned the paddle correctly and bounced the ball back.
6
Reinforcement Learning…..
This is called the credit assignment problem – i.e., which of the preceding actions were responsible for getting the reward and to what extent. Once you have figured out a strategy to collect a certain number of rewards, should you stick with it or experiment with something that could result in even bigger rewards? Will you be satisfied with this or do you want more?
7
Reinforcement Learning…..
This is called the explore-exploit dilemma – should you exploit the known working strategy or explore other, possibly better strategies? Reinforcement learning is an important model of how we (and all animals) learn. Praise from our parents, grades in school, salary at work – are all examples of rewards. Credit assignment problems and exploration-exploitation dilemmas come up every day both in business and in relationships.
8
Formalize a RL Problem The most common method is to represent it as a Markov decision process Suppose you are an agent, situated in an environment (e.g. Breakout game). The environment is in a certain state (e.g. location of the paddle, location and direction of the ball, existence of every brick and so on). The agent can perform certain actions in the environment (e.g. move the paddle to the left or to the right). These actions sometimes result in a reward (e.g. increase in score).
9
Formalize a RL Problem Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. The rules for how you choose those actions are called policy. The environment in general is stochastic, means that the next state may be somewhat random (e.g. when you lose a ball and launch a new one, it goes towards a random direction). Markov decision process
10
Markov Decision Process
The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a MDP. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:
11
Rewards and Total Rewards
12
Discounted Rewards
13
Q-learning
14
Reinforcement Learning (RL)
RL is how to map situations to actions-so as to Maximize a numerical reward signal (Value Function). The learner is not told which actions to take but instead must discover which actions yield the most reward by trying them. Essentially a simulation-based dynamic programming and is primarily used to solve MDPs U.K.Nair
15
Distinguishing features of RL
Trial-and-Error Search and Delayed Reward Model-free methods-avoid the curse of modeling. RL stores the value function in the form of Q-factors. An MDP has millions of states. It use the function approximation methods, such as NNs, regression and interpolation, which need only a small number of scalars to approximate Q-factors of these states. - avoid the curse of dimensionality. U.K.Nair
16
Comparison of DP,RL & Heuristic Algorithms
Method Level of Modeling Effort Solution Quality DP High RL Medium Heuristics Low U.K.Nair
17
Reinforcement Learning …..
In RL, the agent receives at each time step k a representation of the state where S is the set of possible states, of the environment and performs an action where is the set of actions available in state. This action changes the state of the environment to a new state. The environment responds by giving the agent a reward , according to a reward function. The reward function is generated based on quality of the action at state . The selection of an action is done according to a stochastic policy. An optimal policy, denoted is one that corresponds to the greatest received return by the agent. The environment is defined as everything outside of the agent.
18
Reinforcement Learning …..
It receives an action from the agent, and outputs a new state and reward to the agent. The agent is the learner and decision maker in the RL framework. It can influence the state of the environment by performing an action and receives a reward based on the state transition. The goal of the agent is to maximize the return, which is a function of the rewards, over a trajectory generated by a policy. The agent receives a reward from the environment, based on how good or how bad a particular action in a particular state was.
19
Maze Example
20
Maze Example: Policy
21
Maze Example: Value Function
22
Maze Example: Model
23
Temporal Difference Learning
In conventional learning methods the error which is difference between the actual and the desired output is used as a means of learning Sutton, Richard S. in 1988 proposed the novel idea of learning to predict by the method of temporal differences. He proposed an incremental algorithm which used past experience with an incompletely known system to predict its future behavior
24
TD Learning …. It uses the difference between temporally successive predictions as a means of learning. Temporal Difference (TD) Learning methods can be used to estimate value functions.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.