OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
RL for Large State Spaces: Value Function Approximation
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
An Introduction to Markov Decision Processes Sarah Hickmott
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Distributed Q Learning Lars Blackmore and Steve Block.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
ONLINE Q-LEARNER USING MOVING PROTOTYPES by Miguel Ángel Soto Santibáñez.
Collaborative Reinforcement Learning Presented by Dr. Ying Lu.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Introduction Many decision making problems in real life
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Distributed Q Learning Lars Blackmore and Steve Block.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Markov Decision Process (MDP)
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Reinforcement Learning (1)
"Playing Atari with deep reinforcement learning."
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Reinforcement Nisheeth 18th January 2019.
Presentation transcript:

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI

WHAT IS OF-Q LEARNING Object Focused Q-learning (OF-Q), a novel re-inforcement learning algorithm that can offer exponential speed-ups over classic Q-learning on domains composed of independent objects. An OF-Q agent treats the state space as a collection of objects organized into different object classes. Key part is a control policy that uses non-optimal Q-functions to estimate the risk of ignoring parts of the state space.

REINFORCEMENT LEARNING Why reinforcement learning isn’t enough? Because of the curse of dimensionality, the time required for convergence in high-dimensional state spaces can make the use of RL impractical. Solution? There is no easy solution for this problem Observing human behaviours

REINFORCEMENT LEARNING Adult humans are consciously aware of only one stimulus out of each 3 hundred thousand received and they can only hold a maximum of 3 to 5 meaningful items or chunks in their short- term memory. Humans deal with high-dimensionality simply by paying attention to a very small number of features at once, a capability known as selective attention.

OF-Q LEARNING OF-Q embraces the approach of paying attention to a small number of objects at any moment just like human Instead of a high-dimensional policy, OF-Q simultaneously learn a collection of low- dimensional policies along with when to apply each one, i.e. : where to focus at each moment.

OF-Q LEARNING OF-Q learning shares same concept with ; 1.OO-MDP OO-MDP solvers see the state space as a combination of objects of specific classes. Solvers also need a designer to define a set of domain specic relations that define how different objects interact with each other. 2.Modular reinforcement learning (RMDP) Requires a full dynamic Bayesian network which needs either an initial policy that performs well or a good cost heuristic of the domain. Differs with; Learning the Q-values of non-optimal policies to measure the consequences of ignoring parts of the state space OF-Q only relies on online exploration of the domain. For each object class, we learn the Q-values for optimal and non-optimal policies and use that information for the global control policy.

NOTATION A Markov decision process (MDP) is a tuple The probability of transitioning to state s0 when taking action a in state s Immediate reward when taking action a in state s, Discount factor

NOTATION - 2 Defines which action an agent takes in a particular state s. the state-value of s when following policy, i.e., the expected sum of discounted rewards that the agent obtains when following policy from state s. the discounted reward received when choosing action a in state s and then following policy.

NOTATION - 3 is the optimal policy maximizing the value of each state. and are then the optimal state value and Q- values. Q values are not always optimal respect to a given policy that is not necessarily optimal.

OF-Q LEARNING’S TARGET OF-Q is designed to solve episodic MDPs with the following properties; The state space S is defined by a variable number of independent objects. These objects are organized into classes of objects that behave alike. The agent is seen as an object of a specific class, constrained to be instantiated exactly once in each state. Other classes can have none to many instances in any particular state. Each object provides its own reward signal and the global reward is the sum of all object rewards. Reward from objects can be positive or negative

C is the set of object classes in the domain. Using the same sample, the Q-value estimate for the random policy of class o:class is updated with

OVERVIEW OF OF-Q

Control Policy Control policy is simple. First decide A, the set of actions that are safe to take

OVERVIEW OF OF-Q (THRESHOLD OPERATIONS) Threshold initialization To avoid poor actions that result in low rewards, threshold for each class is initialized with Qmin, worst Q value of domain. ARBITRATION Modular RL approaches use two simple arbitration policy: 1.Winner takes all : module with highest Q value decides the next action 2.Greatest mass : module with highest sum of the Q values decides the next action

TEST DOMAIN Experimental domain for testing OF-Q is space invaders game

ILLUSION OF CONTROL Winner takes all Problem with winner takes all approach is that; it may take an action that is very positive for one object but fatal for the overall reward. In the Space Invaders domain, this control policy would be completely blind to the bombs that the enemies drop, because there will always be an enemy to kill that offers a positive Q-value, while bomb Q-values are always negative.

ILLUSION OF CONTROL With these two sources of reward, greatest- mass would choose the lower state, expecting a reward of 10. The optimal action is going to the upper state. Greatest Mass It does not make sense to sum Q-values from different policies, because Q-values from different modules are defined with respect to different policies, and in subsequent steps we will not be able to follow several policies at once.

OF-Q ARBITRATION For the pessimal Q-values, both bombs are just as dangerous, because they both can possibly hit the ship. Random policy Q- Values will identify the closest bomb as a bigger threat.

OF-Q ARBITRATION

OF-Q LEARNING (SPACE INVADERS)

OF-Q LEARNING (NORMANDY) Normandy. The agent starts in a random cell in the bottom row and must collect the two rewards randomly placed at the top, avoiding the cannon re.

OF-Q LEARNING (NORMANDY)

END Thanks