Reinforcement Learning with Partially Known World Dynamics

Slides:

Advertisements

Similar presentations

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California

Advertisements

Markov Decision Process

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.

Patch to the Future: Unsupervised Visual Prediction

 By Ashwinkumar Ganesan CMSC 601.  Reinforcement Learning  Problem Statement  Proposed Method  Conclusions.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Introduction Presented by Alp Sardağ.

Markov Decision Processes

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

8/9/20151 DARPA-MARS Kickoff Adaptive Intelligent Mobile Robots Leslie Pack Kaelbling Artificial Intelligence Laboratory MIT.

Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Vinay Papudesi and Manfred Huber.  Staged skill learning involves:  To Begin:  “Skills” are innate reflexes and raw representation of the world. 

Reinforcement Learning

1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Augmenting Physical State Prediction Through Structured Activity Inference Nam Vo & Aaron Bobick ICRA 2015.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Machine Learning 5. Parametric Methods.

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Unclassified//For Official Use Only 1 RAPID: Representation and Analysis of Probabilistic Intelligence Data Carnegie Mellon University PI : Prof. Jaime.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Figure 5: Change in Blackjack Posterior Distributions over Time.

Markov Decision Process (MDP)

M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)

Generative Adversarial Imitation Learning

TÆMS-based Execution Architectures

CS b659: Intelligent Robotics

Petar Kormushev, Sylvain Calinon and Darwin G. Caldwell

István Szita & András Lőrincz

Timothy Boger and Mike Korostelev

Reinforcement Learning

Markov ó Kalman Filter Localization

Dynamical Statistical Shape Priors for Level Set Based Tracking

Policy Gradient in Continuous Time

Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA

Introduction to particle filter

Joelle Pineau: General info

UAV Route Planning in Delay Tolerant Networks

Hierarchical POMDP Solutions

CS498-EA Reasoning in AI Lecture #20

Introduction to particle filter

Apprenticeship Learning via Inverse Reinforcement Learning

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

یادگیری تقویتی Reinforcement Learning

Texture Image Extrapolation for Compression

Chapter 10: Dimensions of Reinforcement Learning

Introduction to Reinforcement Learning and Q-Learning

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

CS 416 Artificial Intelligence

Department of Computer Science Ben-Gurion University

Presentation transcript:

Reinforcement Learning with Partially Known World Dynamics Christian R. Shelton Stanford University

Reinforcement Learning Environment State Dynamics Agent Actions Rewards Goal

Motivation Reinforcement learning promises great things Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)

Our Approach Partial World Modeling Example: Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two

Partially Known Markov Decision Process (PKMDP) Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 Interface: x0 x1 x2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2

Partially Known Markov Decision Process (PKMDP) unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2

Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z) Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy

Algorithm Details Unknown Dynamics: Known Dynamics: Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient

Total Estimate For each sample, K and V involve reasoning in the DBN: z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2

Load-Unload Example 26 states, 14 observations, 4 actions Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known

Clogged Pipe Example 144 states, 12 observations, 8 actions Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown

Conclusion Advantages Current Work Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface