Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning Tutorial
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Salem, OR 4 th owl attack in month. Any Questions? Programming Assignments?
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Comparison Value vs Policy iteration
Reinforcement Learning
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Sutton & Barto, Chapter 4 Dynamic Programming. Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.
Reinforcement learning
Reinforcement Learning (1)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
An Overview of Reinforcement Learning
Markov Decision Processes
CS 188: Artificial Intelligence
Reinforcement Learning
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CAP 5636 – Advanced Artificial Intelligence
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Questions?

Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken Vision System Exercise 3.5: agent doesn’t improve Constructing state representation

Unfortunately, interval estimation methods are problematic in practice because of the complexity of the statistical methods used to estimate the confidence intervals. There is also a well-known algorithm for computing the Bayes optimal way to balance exploration and exploitation. This method is computationally intractable when done exactly, but there may be efficient ways to approximate it.

Bandit Algorithms Goal: minimize regret Regret: defined in terms of average reward Average reward of best action is μ* and any other action j as μ j. There are K total actions. T j (n) is # times tried action j during our n executed actions.

UCB1 Calculate confidence intervals (leverage Chernoff-Hoeffding bound) For each action j, record average reward x j and the number of times we’ve tried it as n j. n is the total number of actions we’ve tried. Try the action that maximizes xj+xj+

UCB1 regret

UCB1 - Tuned Can compute sample variance for each action, σ j Easy hack for non-stationary environments?

Adversarial Bandits Optimism can be naïve Reward vectors must be fixed in advance of the algorithm running. Payoffs can depend adversarially on the algorithm the player decides to use. Ex: if the player chooses the strategy of always picking the first action, then the adversary can just make that the worst possible action to choose. Rewards cannot depend on the random choices made by the player during the game.

Why can’t the adversary just make all the payoffs zero? (or negative!)

In this event the player won’t get any reward, but he can emotionally and psychologically accept this fate. If he never stood a chance to get any reward in the first place, why should he feel bad about the inevitable result? What a truly cruel adversary wants is, at the end of the game, to show the player what he could have won, and have it far exceed what he actually won. In this way the player feels regret for not using a more sensible strategy, and likely returns to the casino to lose more money. The trick that the player has up his sleeve is precisely the randomness in his choice of actions, and he can use its objectivity to partially overcome even the nastiest of adversaries.

Exp3: Exponential-weight algorithm for Exploration and Exploitation

k-Meterologists Problem ICML-09, Diuk, Li, and Leffler Imagine that you just moved to a new town that has multiple (k) radio and TV stations. Each morning, you tune in to one of the stations to find out what the weather will be like. Which of the k different meteorologists making predictions every morning is the most trustworthy? Let us imagine that, to decide on the best meteorologist, each morning for the first M days you tune in to all k stations and write down the probability that each meteorologist assigns to the chances of rain. Then, every evening you write down a 1 if it rained, and a 0 if it didn’t. Can this data be used to determine who is the best meteorologist? Related to expert algorithm selection

PAC Subset Selection in Stochastic Multi- armed Bandits ICML-12 Select best subset of m arms out of n possible arms

Sutton & Barto: Chapter 3 Defines the RL problem Solution methods come next what does it mean to solve an RL problem?

Discounting Discount factor: γ Discounted Return: t could go to infinity or γ could be 1, but not both… What do values of γ at 0 and 1 mean? Is γ pre-set or tuned? =

Episodic vs. Continuing =

Markov Property One-step dynamics Why useful? Where would it be true?

Value Functions Maximize Return: State value function:

Value Functions Maximize Return: State value function: If policy is deterministic: Bellman Equation for V π

Value Functions Maximize Return: Action-value function

Value Functions Maximize Return: Action-value function Deterministic Policy: Bellman Equation for Q π

Value Functions Maximize Return: Action-value function Deterministic Policy: Bellman Equation for Q π

Optimal Value Functions = Bellman Optimality Equation for V*

Optimal Value Functions = Bellman Optimality Equation for V*

Optimal Value Functions = Bellman Optimality Equation for V* Bellman Optimality Equation for Q*

Optimal Value Functions = Bellman Optimality Equation for V* Bellman Optimality Equation for Q*

Next Up How do we find V* and/or Q*? Dynamic Programming Monte Carlo Methods Temporal Difference Learning

Policy Iteration Policy Evaluation – For all states, improve estimate of V(s) based on policy Policy Improvement – For all states, improve policy(s) by looking at next state values

RL Discount factor: γ Discounted Return: =

Action-value function