Download presentation
Presentation is loading. Please wait.
Published byNickolas Fletcher Modified over 6 years ago
1
Solving Interleaved and Blended Sequential Decision-Making Problems through Modular Neuroevolution
Jacob Schrum Risto Miikkulainen
2
Introduction Challenge: Discover behavior automatically
Simulations Robotics Video games Why challenging? Complex domains Multiple agents Multiple objectives Multimodal behavior required
3
When is Multimodal Behavior Needed?
Agent must exhibit multiple behaviors Domain consists of multiple tasks How to identify/distinguish tasks? Interleaved Each task is separate Sharp borders between tasks Blended Tasks partially overlap Blurred border between tasks
4
What types of policies work best?
Previous results show great success in Ms. Pac-Man (Schrum and Miikkulainen, GECCO 2014) Ms. Pac-Man has blended tasks Mixture of threat/edible ghosts Neuroevolution discovers own task divisions What about interleaved tasks? Introduce a version of Ms. Pac-Man with interleaved tasks Ghosts are all threats or all edible What about human-specified task divisions? Evolve multitask networks One policy for each task Understand problem better with formalism (in paper)
5
Ms. Pac-Man Multimodal behavior needed Predator/prey variant
Pac-Man takes on both roles Goals: Maximize score by Eating all pills in each level Avoiding threatening ghosts Eating ghosts (after power pill) Non-deterministic Very noisy evaluations Four mazes Behavior must generalize Human Play
6
Two Versions of Ms. Pac-Man
Blended Original game Edible and threat ghosts present simultaneously Distinct threat/edible regions Four ghosts can be mix of threat/edible/imprisoned Interleaved Modified version Ghosts all edible or all threat Eaten ghosts stay in prison after being eaten Easier to eat ghosts and stay alive in this version
7
Modular Policy One policy consisting of several policies/modules
Number preset, or learned Means of arbitration also needed Human-specified, or learned via preference neurons Outputs: Inputs: Multitask (Caruana 1997) Preference Neurons
8
Constructive Neuroevolution
Genetic Algorithms + Neural Networks Three basic mutations + Crossover Other structural mutations possible (NEAT by Stanley and Miikkulainen 2004) Perturb Weight Add Connection Add Node
9
Module Mutation A mutation that adds a module
MM(Duplicate) copies previous module New module can diverge as weights change Can happen more than once MM(Duplicate) (cf Calabretta et al 2000)
10
Experiments Evolved using Modular Multiobjective NEAT ( Each network type evaluated in 30 runs Standard networks with one module (1M) Nets with two (2M) or three (3M) modules + preference neurons Nets starting with one module, subject to MM(D) Multitask networks with human-specified task divisions Interleaved: One module for edible, one for threats (MT) Blended: One module if any ghost is edible, one otherwise (MT2) One module for edible, one for threats, one for mixed (MT3)
11
← Interleaved Results All modular approaches superior Final differences significant (p < 0.05) Higher scores overall (easier domain) Different behaviors have similar scores Blended Results → Preference neurons superior Multitask as bad as one module Harder domain with lower scores Certain behaviors distinctly better Final differences significant (p < 0.05)
12
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score
13
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score
Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts
14
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score
LITTLE OVERLAP SMALL GAP Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts
15
Blended Post-Evolution Evaluations: Primary Module Usage vs. Score
16
Blended Post-Evolution Evaluations: Primary Module Usage vs. Score
LARGER GAP MORE OVERLAP Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts (Imperfect division)
17
Threat/Edible Split With Multitask Networks
Interleaved Two Module Multitask Champion Blended Two Module Multitask Champion Blended Three Module Multitask Champion Similar behavior is exhibited in both domains. It is competent, but not outstanding
18
Threat/Edible Split With Preference Neurons
Interleaved Two Module Champion Blended Three Module Champion (only uses two modules) MM(D) champions are similar. Significant usage of 3+ modules does not occur.
19
Escape/Luring Behavior (Only Possible With Preference Neurons)
Interleaved Two Module Champion Blended Two Module Champion A special “escape” module is rarely used, but its use has a big impact. Escaping ghosts when nearly surrounded helps Ms. Pac-Man survive longer. Escaping to a power pill after luring ghosts close helps eat ghosts to maximize points
20
Discussion Obvious division is between edible and threat
Works fine in interleaved domain Causes problems in blended domain Strict Multitask divisions do not perform well Better division: one module when surrounded Very asymmetrical: surprising Useful in interleaved domain, vital in blended domain Module activates when Pac-Man almost surrounded Often leads to eating power pill: luring Helps Pac-Man escape in other risky situations
21
Conclusion Interleaved tasks Blended tasks Many approaches work well
Multitask network with one module per task sufficient Preference neurons can still learn better task division Blended tasks Harder scenario dealt with in more real domains Human-specified task divisions (Multitask) break down Evolution needs freedom to discover own task division Results in novel, successful task division (escaping/luring)
22
Questions? Contact me: schrum2@southwestern.edu Movies: Code:
Code:
23
Auxiliary Slides
24
Multimodal Behavior Animals can perform many different tasks
Imagine learning a monolithic policy as complex as a cardinal’s behavior: HOW? Problem more tractable if broken into component behaviors Flying Nesting Foraging
25
Formalism (1/2)
26
Formalism (2/2)
27
Previous Work in Pac-Man
Custom Simulators Genetic Programming: Koza 1992 Neuroevolution: Gallagher & Ledwich 2007, Burrow & Lucas 2009, Tan et al. 2011 Reinforcement Learning: Burrow & Lucas 2009, Subramanian et al. 2011, Bom 2013 Alpha-Beta Tree Search: Robles & Lucas 2009 Screen Capture Competition: Requires Image Processing Evolution & Fuzzy Logic: Handa & Isozaki 2008 Influence Map: Wirth & Gallagher 2008 Ant Colony Optimization: Emilio et al. 2010 Monte-Carlo Tree Search: Ikehata & Ito 2011 Decision Trees: Foderaro et al. 2012 Pac-Man vs. Ghosts Competition: Pac-Man Genetic Programming: Alhejali & Lucas 2010, 2011, 2013, Brandstetter & Ahmadi 2012 Monte-Carlo Tree Search: Samothrakis et al. 2010, Alhejali & Lucas 2013 Influence Map: Svensson & Johansson 2012 Ant Colony Optimization: Recio et al. 2012 Pac-Man vs. Ghosts Competition: Ghosts Neuroevolution: Wittkamp et al. 2008 Evolved Rule Set: Gagne & Congdon 2012 Monte-Carlo Tree Search: Nguyen & Thawonmos 2013
28
Evolved Direction Evaluator
Inspired by Brandstetter and Ahmadi (CIG 2012) Net with single output and direction-relative sensors Each time step, run net for each available direction Pick direction with highest net output Right Preference Left Preference argmax Left
29
Preference Neuron Arbitration
How can network decide which module to use? Find preference neuron (grey) with maximum output Corresponding policy neurons (white) define behavior 0.7 > 0.1, So use Module 2 Policy neuron for Module 2 has output of 0.5 [0.6, 0.1, 0.5, 0.7] Output value 0.5 defines agent’s behavior Outputs: Inputs:
30
Pareto-based Multiobjective Optimization (Pareto 1890)
High health but did not deal much damage Tradeoff between objectives Dealt lot of damage, but lost lots of health (Deb et al. 2000)
31
Non-dominated Sorting Genetic Algorithm II (Deb et al. 2000)
Population P with size N; Evaluate P Use mutation (& crossover) to get P´ size N; Evaluate P´ Calculate non-dominated fronts of P È P´ size 2N New population size N from highest fronts of P È P´
32
Direction Evaluator + Modules
Network is evaluated in each direction For each direction, a module is chosen Human-specified (Multitask) or Preference Neurons Chosen module policy neuron sets direction preference [0.5, 0.1, 0.7, 0.6] → 0.6 > 0.1: Left Preference is 0.7 [0.3, 0.8, 0.9, 0.1] → 0.8 > 0.1: Right Preference is 0.3 0.7 > 0.3 Ms. Pac-Man chooses to go left, based on Module 2 [Left Inputs] [Right Inputs]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.