Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/InSyTe.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

RL for Large State Spaces: Value Function Approximation
A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft Santiago Ontanon, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Machine Learning in Computer Games Learning in Computer Games By: Marc Ponsen.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Interactivity Dr. Héctor Muñoz-Avila Assigned readings: Chapter 6 (Rules of Play Book)
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
Hierarchical Plan Representations for Encoding Strategic Game AI Hai Hoang Stephen Lee-Urban Héctor Muñoz-Avila Lehigh University
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Chains, Markov Decision Processes (MDP), Reinforcement Learning: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban Megan Smith
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning in Real-Time Strategy Games Nick Imrei Supervisors: Matthew Mitchell & Martin Dick.
The Effects of Loss and Latency on User Performance in Unreal Tournament 2003 Tom Beigbeder, Rory Coughlan, Corey Lusher, John Plunkett, Emmanuel Agu,
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
GoogolHex CS4701 Final Presentation Anand Bheemarajaiah Chet Mancini Felipe Osterling.
Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.
國立雲林科技大學 National Yunlin University of Science and Technology Building Reactive Characters for Dynamic Gaming Environments Peter Blackburn and Barry O’Sullivan,
Reinforcement Learning
Introduction Many decision making problems in real life
Strategic Planning for Unreal Tournament© Bots Héctor Muñoz-Avila Todd Fisher Department of Computer Science and Engineering Lehigh University USA Héctor.
Game AI versus AI: An Introduction to AI Game Programming Héctor Muñoz-Avila.
Game-playing AIs Part 1 CIS 391 Fall CSE Intro to AI 2 Games: Outline of Unit Part I (this set of slides)  Motivation  Game Trees  Evaluation.
Systems and Interactivity Dr. Héctor Muñoz-Avila Assigned readings: Chapters 4 & 5 (Rules of Play Book)
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
AI and Computer Games (informational session) Lecture by: Dustin Dannenhauer Professor Héctor Muñoz-Avila Computer Science and Eng.
Justin Karneeb.  A Case Based Reasoning system developed for use in DOM, a Domination style game  A Research Project which has been under development.
Hierarchical Plan Representations for Encoding Strategic Game AI Hai Hoang Stephen Lee-Urban Héctor Muñoz-Avila Lehigh University
CSC Intro. to Computing Lecture 22: Artificial Intelligence.
Ibrahim Fathy, Mostafa Aref, Omar Enayet, and Abdelrahman Al-Ogail Faculty of Computer and Information Sciences Ain-Shams University ; Cairo ; Egypt.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.
Systems and Interactivity Dr. Héctor Muñoz-Avila Assigned readings: Chapters 4 & 5 (Rules of Play Book)
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
Shing Lau & Ben Ingarfield. Overview  Genre: 3D Adventure / Shooter  Perspective: 1 st Person (3 rd Person optional)  Game play:  Player: An Assassin.
Reinforcement learning (Chapter 21)
Software Multiagent Systems: CS543 Milind Tambe University of Southern California
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Will Britt and Bryan Silinski
The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
2013 Section Meeting Coaching Workshop Maximizing Coaching Moments with Young Players.
1/23 A Benchmark for StarCraft Intelligent Agents Alberto Uriarte and Santiago Ontañón Drexel University Philadelphia November 15, 2015.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Stochastic tree search and stochastic games
Reinforcement learning (Chapter 21)
Games with Chance Other Search Algorithms
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
THe University of Georgia Genetic Algorithm BOT
Reinforcement Learning
Chapter 1: Introduction
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban

Outline Introduction  Adaptive Game AI  Domination games in Unreal Tournament©  Reinforcement Learning Adaptive Game AI with Reinforcement Learning  RETALIATE – architecture and algorithm Empirical Evaluation Final Remarks – Main Lessons

Introduction Adaptive Game AI, Unreal Tournament, Reinforcement Learning

Adaptive AI in Games Without (shipped) LearningWith Learning Non Stochastic StochasticOfflineOnline Symbolic (FOL, etc.) ScriptsHTN Planning Trained VSDecision Tree Sub-Symbolic (weights, etc.) Stored NNsGenetic Alg.RL offlineRL online In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters HTNbots: we presented this before Lee-Urban et al, ICAPS

Adaptive Game AI and Learning Learning – Motivation  Combinatorial explosion of possible situations Tactics (e.g., competing team’s tactics) Game worlds (e.g., map where the game is played) Game modes (e.g., domination, capture the flag)  Little time for development Learning – the “Cons”  Difficult to control and predict Game AI  Difficult to test

Unreal Tournament© (UT) Online FPS developed by Epic Games Inc Six gameplay modes including team deathmatch and domination games Gamebots: a client- server architecture for controlling bots started by U.S.C. Information Sciences Institute (ISI)

UT Domination Games A number of fixed domination locations. Ownership: the team of last player to step into location Scoring: a team point awarded for every five seconds location remains controlled Winning: first team to reach pre- determined score (50) (top-down view)

Reinforcement Learning

Some Introductory RL Videos   

Reinforcement Learning Agents learn policies through rewards and punishments Policy - Determines what action to take from a given state (or situation) Agent’s goal is to maximize returns (example) Tabular Techniques We maintain a “Q-Table”:  Q-table: State × Action  value

The DOM Game Domination Points Wall Spawn Points Lets write on blackboard: a policy for this and a potential Q-table

Example of a Q-Table ACTIONS STATES “good” action “bad” action Best action identified so far For state “EFE” (Enemy controls 2 DOM points)

Reinforcement Learning Problem ACTIONS STATES How can we identify for every state which is the BEST action to take over the long run?

Let Us Model the Problem of Finding the best Build Order for a Zerg Rush as a Reinforcement Learning Problem

Adaptive Game AI with RL RETALIATE (Reinforced Tactic Learning in Agent-Team Environments)

The RETALIATE Team Controls two or more UT bots Commands bots to execute actions through the GameBots API  The UT server provides sensory (state and event) information about the UT world and controls all gameplay  Gamebots acts as middleware between the UT server and the Game AI

The RETALIATE Algorithm

Initialization Game model:  n is the number of domination points  (Owner 1, Owner 2, …, Owner n ) For all states s and for all actions a Q[s,a]  0.5 Actions:  m is the number of bots in team  (goto 1, goto 2, …, goto m ) Team 1 Team 2 … None loc 1 loc 2 … loc n

Rewards and Utilities U(s) = F( s ) – E( s ),  F(s) is the number of friendly locations  E(s) is the number of enemy-controlled locations R = U( s’ ) – U( s ) Standard Q-learning ([Sutton & Barto, 1998]):  Q(s, a) ← Q(s, a) +  ( R + γ max a’ Q(s’, a’) – Q(s, a))

Rewards and Utilities U(s) = F( s ) – E( s ),  F(s) is the number of friendly locations  E(s) is the number of enemy-controlled locations R = U( s’ ) – U( s ) Standard Q-learning ([Sutton & Barto, 1998]):  Q(s, a) ← Q(s, a) +  ( R + γ max a’ Q(s’, a’) – Q(s, a))  “step-size” parameter  was set to 0.2  discount-rate parameter γ was set close to 0.9  Thus, most recent state-reward pairs are considered more important than earlier state-reward pairs

State Information and Actions x, y, z Player Scores Team Scores Domination Loc. Ownership Map TimeLimit Score Limit Max # Teams Max Team Size Navigation (path nodes…) Reachability Items (id, type, location…) Events (hear, incoming…) SetWalk RunTo Stop Jump Strafe TurnTo Rotate Shoot ChangeWeapon StopShoot

Managing (State x Action) Growth Our Table:  States: ( {E,F,N}, {E,F,N}, {E,F,N} ) = 27  Actions: ( {L1, L2, L3}, …) = 27  27 x 27 = 729  Generally, 3 #loc x #loc #bot Adding health, discretized (high, med, low)  States: (…, {h,m,l}) = 27 x 3 = 81  Actions: ( {L1, L2, L3, Health}, … ) = 4 3 = 64  81 x 64 = 5184  Generally, 3 (#loc+1) x (#loc+1) #bot Number of Locations, size of team frequently varies.

Empirical Evaluation Opponents, Performance Curves, Videos

The Competitors Team NameDescription HTNBotHTN planning. We discussed this previously OpportunisticBotBots go from one domination location to the next. If the location is under the control of the opponent’s team, the bot captures it. PossesiveBotEach bot is assigned a single domination location that it attempts to capture and hold during the whole game GreedyBotAttempts to recapture any location that is taken by the opponent RETALIATEReinforcement Learning

Summary of Results Against the opportunistic, possessive, and greedy control strategies, RETALIATE won all 3 games in the tournament.  within the first half of the first game, RETALIATE developed a competitive strategy. 5 runs of 10 games opportunistic  possessive  greedy

Summary of Results: HTNBots vs RETALIATE (Round 1)

Summary of Results: HTNBots vs RETALIATE (Round 2) Time Score RETALIATE HTNbots Difference

Video: Initial Policy (top-down view) RETALIATE Opponent

Video: Learned Policy RETALIATE Opponent

Final Remarks Lessons Learned, Future Work

Final Remarks (1) From our work with RETALIATE we learned the following lessons, beneficial to any real-world application of RL for these kinds of games:  Separate individual bot behavior from team strategies.  Model the problem of learning team tactics through a simple state formulation.

Final Remarks (2) It is very hard to predict all strategies beforehand  As a result, RETALIATE was able to find a weakness and exploit it to produce a winning strategy that HTNBots could not counter  On the other hand HTNBots produce winning strategies against the other opponents from the beginning while it took RETALIATE half a game in some situations  Tactics emerging from RETALIATE might be difficult to predict, a game developer will have a hard time maintaining the Game AI

Thank you! Questions?