Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
RL for Large State Spaces: Value Function Approximation
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Background Reinforcement Learning (RL) agents learn to do tasks by iteratively performing actions in the world and using resulting experiences to decide.
Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Planning under Uncertainty
Reinforcement Learning
INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
Learning a Multiagent Behavior Decision Tree Learning for Pass Evaluation.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
8/9/20151 DARPA-MARS Kickoff Adaptive Intelligent Mobile Robots Leslie Pack Kaelbling Artificial Intelligence Laboratory MIT.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
Reinforcement Learning
Introduction Many decision making problems in real life
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.
Tetris Agent Optimization Using Harmony Search Algorithm
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Top level learning Example: Pass selection using TPOT-RL.
Reinforcement Learning
Software Multiagent Systems: CS543 Milind Tambe University of Southern California
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CASE − Cognitive Agents for Social Environments
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
Reinforcement Learning
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Top level learning Pass selection using TPOT-RL

DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded function to limit the potential receivers to those that are at least as close to the opponent‘s goal as the passer always passes to the potential receiver with the highest confidence of success (max  (passer, receiver))

requirement in „reality“ best pass may be to a receiver farther away from the goal than the passer the receiver that is most like to successfully receive the pass may not be the one that will subsequently act most favorable for the team

backward pass situation

Pass Selection - a team behavior learn how to act strategically as part of a team requires understanding of long-term effects of local decisions given the behaviors and abilities of teammates and opponents measured by the team‘s long-term success in a real game -> must be trained on-line against an opponent

ML algorithm characteristics for pass selection On-line Capable of dealing with a large state space despite limited training Capable of learning based on long-term, delayed reward Capable of dealing with shifting concepts Works in a team-partitioned scenario Capable of dealing with opaque transitions

TPOT-RL succeeds by: Partitioning the value function among multiple agents Training agents simultaneously with a gradually decreasing exploration rate Using action-dependent features to aggressively generalize the state space Gathering long-term, discounted reward directly from the environment

TPOT-RL: policy mapping (S -> A) State generalization Value function learning Action selection

State generalization I Mapping state space to feature vector f : S -> V Using action-dependent feature function e : S x A -> U Partitioning state space among agents P : S -> M

State generalization II |M| >= m... Number of agents in team A = {a 0,..., a n-1 } f(s) = V = U |A| x M

Value function learning I Value function Q(f(s), a i ) Q : V x A -> reell Depends on e(s, a i ) independent of e(s, a j )  j  i Q-table has |U| 1 * |M| * |A| entries

Value function learning II f(s) = v Q(v, a) = Q(v, a) +  * (r – Q(v, a)) r is derived from observable environmental characteristics Reward function R : S t lim -> reell Range of R is [-Q max, Q max ] Keep track of action taken a i and feature vector v at that time

Action selection Exploration vs. Exploitation Reduce number of free variables with action filter W  U: if e(s, a)  W -> a shouldn‘t be a potential action in s B(s) = {a  A | e(s, a)  W} B(s) = {} (W  U) ?

TPOT-RL applied to simulated robotic soccer 8 possible actions in A (see action space) Extend definition of  (Section 6) Input for L 3 is DT from L 2 to define e

action space

State generalization using a learned feature I M = team‘s set of positions (|M| = 11) P(s) = player‘s current position Define e using DT (C = 0.734) W = {Success}

State generalization using a learned feature II |U| = 2 V = U 8 x {PlayerPositions} |V| = |U| |A| * |M| = 2 8 * 11 Total number of Q-values: |U| * |M| * |A| = 2 * 11 * 8 With action filtering (W): each agent learns |W| * |A| = 8 Q-values 10 training examples per 10-minute game

Value function learning via intermediate reinforcement I R g : if goal is scored r =  Q max / t... t  t lim R i : notice t, x t 3 conditions to fix reward Ball is goes out of bounds at t+t 0 (t 0 < t lim ) Ball returns to agent at t+t r (t r < t lim ) Ball still in bounds at t+t lim

R i : Case 1 Reward r is based on value r 0 t lim = 30 seconds (300 sim. cycles) Q max = 100  = 10

reward function

R i : Cases 2 & 3 r based on average x-position of ball x og = x-coordinate of opponent goal x lg = x-coordinate of learner‘s goal

Value function learning via intermediate reinforcement II After taking a i and receiving r, update Q Q(e(s, a i ), a i ) = (1 -  ) * Q(e(s, a i ), a i ) +  r  = 0.02

Action selection for multiagent training Multiple agents are concurrently learning -> domain is non-stationary To deal with this: Each agent stays in the same state partition throughout training Exploration rate is very high at first, then gradually decreasing

State partitioning Distribute training into |M| partitions each with a lookup-table of size |A| * |U| After training, each agent can be given the trained policy for all partitions

Exploration rate Early exploitation runs the risk of ignoring the best possible actions When in state s choose Action with highest Q-value with prob. p (a i such that  j, Q(f(s),a i )  Q(f(s),a j )) Random Action with probability (1 – p) p increases gradually from 0 to 0.99

Results I Agents start out acting randomly with empty Q-tables  v  V, a  A, Q(v,a) = 0 Probability of acting randomly decreases linearly over periods of 40 games to 0.5 in game 40 to 0.1 in game 80 to 0.01 in game 120 Learning agents use R i

Results II

Result II Statistics minute games |U| = 1 Each agent gets 1490 action-reinforcement pairs -> reinforcement 9.3 times per game tried each action times -> each action only once per game

Results III

Results IV

Results IV Statistics Action predicted to succeed vs. Selected 3 of 8 „attack“ actions (37.5%): 6437 / 9967 = 64.6% Action filtering 39.6% of action options were filtered out  action opportunities B(s)  {}

Results V

Domain characteristics for TPOT-RL: There are multiple agents organized in a team. There are opaque state transitions. There are too many states and/or not enough training examples for traditional RL. The target concept is non-stationary. There is long-range reward available. There are action-dependent features available.

Examples for such domains: Simulated robotic soccer Network packet-routing Information networks Distributed logistics