On-Line Markov Decision Processes for Learning Movement in Video Games

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Reinforcement Learning

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning (1)

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Solving POMDPs through Macro Decomposition

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Markov Decision Process (MDP)

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

CS b659: Intelligent Robotics

A Crash Course in Reinforcement Learning

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Reinforcement Learning in POMDPs Without Resets

Reinforcement learning (Chapter 21)

Markov Decision Processes

Reinforcement Learning

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

Reinforcement Learning

Hidden Markov Models (cont.) Markov Decision Processes

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

On-Line Markov Decision Processes for Learning Movement in Video Games Aaron Arvey aarvey@cs.hmc.edu

Goals Induce human player (HP) movement strategy for non-player character (NPC) Learn in real-time so that HP strategy can be determined and mimicked. Use a reinforcement learning approach Compare results with (very) primitive FSM

HP Movement Every HP has individual style and movement patterns Best strategy for NPC is to use: “dumb” HP  FSM “smart” HP  Learn from HP If you can’t beat ‘em, join ‘em!

Mimicking HP Movement I How does HP transition? How does HP react to NPC? Did HP make the right move? How long should we observe before mimicking actions already seen?

Mimicking HP Movement II Use FSM at the start Observe HP and record reactions Once we have accumulated enough observations, determine optimal policy Assumptions: Game length is sufficient for learning All actions are reactions

Methods Reinforcement learning Rewards States Actions Probabilistic reinforcement learning Add in a probabilistic transition model Markov Decision Processes

Rewards Experimentally determined (subjectively) Represented as function which looked at Seeking closest “dead” balls Dodging closest “live” balls Maintaining distance from HP

States Discretize world into grid State space includes: HP location NPC location Closest live and dead balls

Actions Very simplistic approach Actions are “path” (strategy) oriented NPC can plan to move in four cardinal directions Actions are chosen from a policy determined by a Markov Decision Process

Markov Decision Processes (MDP) Actions, States, Rewards, discount factor, and Probability Model T Discount factor used to weight immediate or future rewards T describes probability of moving from state s to state s’ when action a is performed Produces a policy from which to choose actions

Policy A policy is a mapping from states to actions An optimal policy is the policy which maximizes the value of every state The value of a state is determined by potential rewards that could be received from being in that state.

Value Iteration Determine the approximate expected value of every state An optimal policy can be derived Algorithm is formulated as a dynamic programming problem Infinite time horizon Update the expected value of states via iterative process, halt when “close enough”

Value Iteration Algorithm

Method for Mimicking HP I Use MDP to determine optimal policy Possible actions, states, discounts, and rewards are hard coded – discount = 0.8 Transition model is the only element that we must determine during play time Utilize online methods for solving MDPs once we have a transition model

Method for Mimicking HP II Determining T Use FSM to start game Observe how HP “reacts” to NPC Assume all actions are based on a reactive paradigm Once we have a frequency matrix, we need to adjust for observation bias Use Laplacian prior

Platform: Dodgeball Dodgeball ~16 KLOC of C++ ~2 KLOC of AI code Graphics using OpenGL AI is modular Swap out FSM for MDP based AI

Example Seeking a ball via FSM

Specific Experiments MDP Steering MDP Reasoning, FSM Steering MDP/FSM Steering

MDP Steering Instead of high-level reasoning, MDP does the grunt work Every time step, MDP returns an action Pros: More agent autonomy in comparison Learned how to dodge balls Cons: Get stuck in between states Rigid movement due to restricted action set

MDP with FSM Steering MDP makes a plan (goal state) Similar to what was done in steering experiment FSM carries out the plan (go to goal state) Doesn’t go directly to the goal state Can deviate from plan Pros: Smoother than MDP steering Cons: Less autonomy, FSM does most of the work

MDP/FSM Hybrid Steering Use both FSM (5-10%) and MDP (90-95%) steering Pros: Smoother than MDP steering More autonomy than the MDP with FSM steering Learned how to dodge balls Cons: Still uses FSM Still gets stuck in between states

Extensions I Learn more; more autonomy States – Waypoint learning, neural gas Rewards – Apprenticeship and Inverse RL Actions – Hierachical action learning Take full advantage of updateable model Reevaluate policy

Extensions II Apply to more standardized platforms Quake II via Matlab/Java connection through QASE TIELT game/simulation environment Alternative value iteration algorithms “real time value iteration” (RTDP) Offline value iteration

Questions? Comments?