Staffan Järn
Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good actions The algortithm tries to figure out what is the best action to take in a given state, without knowing the final optimal solution. The actions are based on rewards and penalties.
Robot control Elevator scheduling (search for patterns) Telecommunications (finding networks) Games (Chess, Backgammon) Financial trading
Gridworld (4 x 12) The walker (agent) is supposed to find the shortest or safest way to the finish, without falling into the cliff (blue area) Falling into to cliff gives 100 penalty points, and the walker has to start over again
Q-learning algorithm Matrix, called the Q-matrix 48 x 4 matrix (12x4 gridworld) x 4 (four directions) The Q-matrix contains a ”price” for taking a certain action Initialized randomly in the beginning The walker has two options: Take the optimal action, according to smallest Q-value Explore the gridworld by taking a random step (cannot walk into the wall) Q-value is updated according to the equation every time the walker takes an action
The new value in the Q-matrix for the previous state and taking the previously taken action will be updated based on: what it was before multiplied by (1-α), plus a factor (alfa) multiplied by the sum of the cost to take a step (usually 1, cliff 100) and another factor (gamma) multiplied by the best action the walker can take (optimal action) New valuePrevious step Best action Sum of the cost Alfa = learning factorGamma = reward factor
SARSA-algorithm Another way of updating the Q-matrix Not based on the next optimal move, but on the next actual move Means that it will take into account the risk of falling into the cliff, and will eventually arrive at a safer path Longer, but safer path
Fig 1) Q-learning, the 100-th walk Fig 2) Q-learning, optimal solution Fig 3) SARSA, the 100-th walkFig 4) SARSA, optimal solution
Random steps over the cliff
Reinforcement Learning (pdf), Jonas Waller [2005] Cliffwalker program, Jonas Waller [2005] Reinforcement Learning, An Introduction. Sutton and Barto