On-Line Markov Decision Processes for Learning Movement in Video Games Aaron Arvey aarvey@cs.hmc.edu
Goals Induce human player (HP) movement strategy for non-player character (NPC) Learn in real-time so that HP strategy can be determined and mimicked. Use a reinforcement learning approach Compare results with (very) primitive FSM
HP Movement Every HP has individual style and movement patterns Best strategy for NPC is to use: “dumb” HP FSM “smart” HP Learn from HP If you can’t beat ‘em, join ‘em!
Mimicking HP Movement I How does HP transition? How does HP react to NPC? Did HP make the right move? How long should we observe before mimicking actions already seen?
Mimicking HP Movement II Use FSM at the start Observe HP and record reactions Once we have accumulated enough observations, determine optimal policy Assumptions: Game length is sufficient for learning All actions are reactions
Methods Reinforcement learning Rewards States Actions Probabilistic reinforcement learning Add in a probabilistic transition model Markov Decision Processes
Rewards Experimentally determined (subjectively) Represented as function which looked at Seeking closest “dead” balls Dodging closest “live” balls Maintaining distance from HP
States Discretize world into grid State space includes: HP location NPC location Closest live and dead balls
Actions Very simplistic approach Actions are “path” (strategy) oriented NPC can plan to move in four cardinal directions Actions are chosen from a policy determined by a Markov Decision Process
Markov Decision Processes (MDP) Actions, States, Rewards, discount factor, and Probability Model T Discount factor used to weight immediate or future rewards T describes probability of moving from state s to state s’ when action a is performed Produces a policy from which to choose actions
Policy A policy is a mapping from states to actions An optimal policy is the policy which maximizes the value of every state The value of a state is determined by potential rewards that could be received from being in that state.
Value Iteration Determine the approximate expected value of every state An optimal policy can be derived Algorithm is formulated as a dynamic programming problem Infinite time horizon Update the expected value of states via iterative process, halt when “close enough”
Value Iteration Algorithm
Method for Mimicking HP I Use MDP to determine optimal policy Possible actions, states, discounts, and rewards are hard coded – discount = 0.8 Transition model is the only element that we must determine during play time Utilize online methods for solving MDPs once we have a transition model
Method for Mimicking HP II Determining T Use FSM to start game Observe how HP “reacts” to NPC Assume all actions are based on a reactive paradigm Once we have a frequency matrix, we need to adjust for observation bias Use Laplacian prior
Platform: Dodgeball Dodgeball ~16 KLOC of C++ ~2 KLOC of AI code Graphics using OpenGL AI is modular Swap out FSM for MDP based AI
Example Seeking a ball via FSM
Specific Experiments MDP Steering MDP Reasoning, FSM Steering MDP/FSM Steering
MDP Steering Instead of high-level reasoning, MDP does the grunt work Every time step, MDP returns an action Pros: More agent autonomy in comparison Learned how to dodge balls Cons: Get stuck in between states Rigid movement due to restricted action set
MDP with FSM Steering MDP makes a plan (goal state) Similar to what was done in steering experiment FSM carries out the plan (go to goal state) Doesn’t go directly to the goal state Can deviate from plan Pros: Smoother than MDP steering Cons: Less autonomy, FSM does most of the work
MDP/FSM Hybrid Steering Use both FSM (5-10%) and MDP (90-95%) steering Pros: Smoother than MDP steering More autonomy than the MDP with FSM steering Learned how to dodge balls Cons: Still uses FSM Still gets stuck in between states
Extensions I Learn more; more autonomy States – Waypoint learning, neural gas Rewards – Apprenticeship and Inverse RL Actions – Hierachical action learning Take full advantage of updateable model Reevaluate policy
Extensions II Apply to more standardized platforms Quake II via Matlab/Java connection through QASE TIELT game/simulation environment Alternative value iteration algorithms “real time value iteration” (RTDP) Offline value iteration
Questions? Comments?