Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction 2019. 03. 18 Gyu-Young Hwang LINK@KoreaTech http://link.koreatech.ac.kr.

Slides:

Advertisements

Similar presentations

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Advertisements

Chapter 4: Trees Part II - AVL Tree

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Joost Westra, Frank Dignum,Virginia Dignum Scalable Adaptive Serious Games using Agent Organizations.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Radial Basis Function Networks

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Complete Coverage Path Planning Based on Ant Colony Algorithm International conference on Mechatronics and Machine Vision in Practice, p.p , Dec.

Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.

Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

SM Sec.1 Dated 13/11/10 STRATEGY & STRUCTURE Group 3.

4/22/20031/28. 4/22/20031/28 Presentation Outline  Multiple Agents – An Introduction  How to build an ant robot  Self-Organization of Multiple Agents.

Introduction to Machine Learning, its potential usage in network area,

Convolutional Sequence to Sequence Learning

Reinforcement Learning

Parallel Patterns.

Figure 5: Change in Blackjack Posterior Distributions over Time.

Continuous Control with Prioritized Experience Replay

Environment Generation with GANs

Introduction of Reinforcement Learning

Quantum Simulation Neural Networks

Deep Reinforcement Learning

Chilimbi, et al. (2014) Microsoft Research

A Comparison of Learning Algorithms on the ALE

Adversarial Learning for Neural Dialogue Generation

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 6: Temporal Difference Learning

Mastering the game of Go with deep neural network and tree search

Wireless Sensor Network Architectures

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

AlphaGo with Deep RL Alpha GO.

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Deep reinforcement learning

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Context-based Data Compression

Motion Planning for Multiple Autonomous Vehicles

Reinforcement learning with unsupervised auxiliary tasks

Advanced Operating Systems

"Playing Atari with deep reinforcement learning."

Random Sampling over Joins Revisited

Craig Schroeder October 26, 2004

Logistic Regression & Parallel SGD

Artificial Intelligence 13. Multi-Layer ANNs

Double Dueling Agent for Dialogue Policy Learning

Reinforcement Learning

Neural Networks Geoff Hulten.

Lecture: Deep Convolutional Neural Networks

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Market-based Dynamic Task Allocation in Mobile Surveillance Systems

Chapter 9: Planning and Learning

Designing Neural Network Architectures Using Reinforcement Learning

Fundamentals of Neural Networks Dr. Satinder Bal Gupta

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Visual Navigation Yukun Cui.

Emir Zeylan Stylianos Filippou

DESIGN OF EXPERIMENTS by R. C. Baker

Department of Computer Science Ben-Gurion University of the Negev

Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn

October 20, 2010 Dr. Itamar Arel College of Engineering

Learning and Memorization

Object Detection Implementations

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Akram Bitar and Larry Manevitz Department of Computer Science

Morteza Kheirkhah University College London

Presentation transcript:

Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction 2019. 03. 18 Gyu-Young Hwang LINK@KoreaTech http://link.koreatech.ac.kr

ABSTRACT this paper extends the single-agent A3C algorithm to enable multiple agents to learn a homogeneous, distributed policy, where agents work together toward a common goal without explicitly interacting. the sum of experience of all agents can be leveraged to quickly train a collaborative policy that naturally scales to smaller and larger swarms. We present simulation results where swarms of various sizes successfully construct different test structures without the need for additional training. LINK@KoreaTech

1. INTRODUCTION In this paper, we propose a fully-distributed, scalable learning framework, where multiple agents learn a common, collaborative policy in a shared environment. We focus on a multi-robot construction problem, where simple robots must gather, carry and place simple block elements to build a user-specified 3D structure. This problem is a very difficult game, as robots need to build scaffolding to reach the higher levels of the structure to complete construction. LINK@KoreaTech

1. INTRODUCTION We use a centralized learning architecture, but fully decentralized execution where each agent runs an identical copy of the policy without explicitly communicating with other agents, and yet a common goal is achieved. LINK@KoreaTech

2. Background 2.1 Single-Agent Asynchronous Distributed Learning the learning task is distributed to several agents that asynchronously update a global, shared network, based on their individual experiences in independent learning environments. these methods rely on multiple agents to distributively learn a single-agent policy. The increase in performance is mainly due to the fact that multiple agents produce highly decorrelated gradients. Additionally, agents in independent environments can better cover the state-action space concurrently, and faster than a single agent. LINK@KoreaTech

2. Background 2.2 Multi-Agent Learning most important problem encountered when transitioning from the single- to multi-agent case is the curse of dimensionality. the dimensions of the state-action spaces explode combinatorially, requiring an absurd amount of training data to converge. many recent works have focused on decentralized policy learning, where agents each learn their own policy, which should encompass a measure of agent cooperation at least during training. LINK@KoreaTech

3. Multi-Robot Construction 3.1 Problem Statement and Background Robots need to gather blocks from the sources to collaboratively build a user-specified structure. valid actions for each robot included turning 90 degrees left or right, moving one location forward, and picking up/putting down a block in front of itself. robots can move to an adjacent location if and only if doing so would move the robot vertically up to one block up or down. robots can pick up/put down a block under the condition that it is a top block, on the same level as the robot itself. LINK@KoreaTech

3. Multi-Robot Construction 3.2 Multi-Robot Construction as an RL Problem State Space the state 𝑠 𝑖 of agent i is a 5-tuple of binary tensors: Current position: A one-hot tensor(single 1 at the current position of agent i) Agents’ positions: A binary tensor showing the positions currently occupied by agents in the world (agent i included) Blocks’ positions: A binary tensor expressing the positions of the blocks. Sources’ positions: A binary tensor showing the positions of all block sources Structure plan: A binary tensor expressing the user-specified structure’s plan LINK@KoreaTech

3. Multi-Robot Construction 3.2 Multi-Robot Construction as an RL Problem Action Space The 13 possible actions four “move” actions (move North/East/South/West) four “pick up” actions (N/E/S/W) four “put down” actions one “idle” action an agent only really has access to at most 9 actions, since agents carrying a block can only put down a block and not pick up an additional block, and vice-versa. an agent may sometimes pick an action that has become impossible to perform due to another agent’s action. In such cases, nothing happens and the agent remains idle for one time step. LINK@KoreaTech

3. Multi-Robot Construction 3.2 Multi-Robot Construction as an RL Problem Reward Structure 𝑟 0 = -0.02 𝑟 1 = 1 to place a block at the n-th level of a simple tower of blocks, agents need to build a ramp (or rather, a staircase) of height n−1 to make the n-th level reachable. LINK@KoreaTech

3. Multi-Robot Construction 3.2 Multi-Robot Construction as an RL Problem Actor-Critic Network 5 3D convolution layers compress the input dimension from 3D to 1D to connect to the next 2 fully-connected (fc) layers. Residual shortcuts connect the input layer to the third convolutional layer, the third layer to the first fc layer, and the first fc layer to the LSTM. LINK@KoreaTech

3. Multi-Robot Construction 3.2 Multi-Robot Construction as an RL Problem Actor-Critic Network π( 𝑎 𝑡 | 𝑠 𝑡 ;θ) : the policy estimation for action a (actor) V( 𝑠 𝑡 ; θ 𝑣 ) : the value estimation (critic) We update the policy according to : A( 𝑠 𝑡 , 𝑎 𝑡 ; θ, θ 𝑣 ) refers to the advantage estimate: minimize the log-likelihood of the model selecting invalid actions where 𝑦 𝑎 = 0 if action a is invalid LINK@KoreaTech

4. Online Learning We train multiple agents in a shared environment. Initial blocks are placed at random in the environment, and the initial agents’ positions are also randomized. we randomize the goal structure in the beginning of each episode, to train agents to build any structure. LINK@KoreaTech

4. Online Learning we push agents to explore the state-action space more intensively, by picking a valid action uniformly at random, every time the state predictor “has block” is incorrect. This mechanism is mostly active during early training, since this output is quickly learned and achieves a near-100% prediction accuracy. we trained agents using experience replay, to help them further decorrelate their learning gradients. each agent draws an experience sample from a randomly selected epoch (100 time steps) of its pool and learns from it. LINK@KoreaTech

5. Simulation Results 5.1 Results LINK@KoreaTech

5. Simulation Results 5.1 Results We believe that a more specialized training process could be investigated to improve the agents’ cleanup performance. LINK@KoreaTech

5. Simulation Results 5.2 Comparison with Conventional Planning Our previous approach, which relied on a (conventional) tree-based planner, focused on finding the smallest number of block placements/removals needed by a single agent. Our learning-based approach results in far more blocks being placed/removed during the construction of the structure. This makes sense, since the learned policy is not trained to minimize the number of block placements/removals. LINK@KoreaTech

5. Simulation Results 5.3 Discussion our approach allows scalability to different numbers of agents, as the different agents are completely independent, and do not require explicit interactions or a centralized planner. Even though training was only ever performed using four agents, the learned policy was able to generalize to other swarm sizes, despite never having experienced any situation with a different swarm size. our approach also has its limitations: more actions are needed for plan completion and more scaffolding is built than with an optimized centralized planner. In addition, fully cleaning up the world remains a challenge for the agents. LINK@KoreaTech

Thank you