Collaborative Reinforcement Learning Presented by Dr. Ying Lu.

Slides:



Advertisements
Similar presentations
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.
Advertisements

Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Sogang University ICC Lab Using Game Theory to Analyze Wireless Ad Hoc networks.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
Reinforcement Learning
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Machine Learning Lecture 11: Reinforcement Learning
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Introduction to networking Dynamic routes. Objectives  Define dynamic routing and its properties  Describe the classes of routing protocols  Describe.
MAKING COMPLEX DEClSlONS
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Stochastic Routing Routing Area Meeting IETF 82 (Taipei) Nov.15, 2011.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Communication Paradigm for Sensor Networks Sensor Networks Sensor Networks Directed Diffusion Directed Diffusion SPIN SPIN Ishan Banerjee
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
1 Introduction to Reinforcement Learning Freek Stulp.
MDPs (cont) & Reinforcement Learning
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Self-Organizing Architectures SOAR 2010 International Conference on Autonomic Computing and Communication, ICAC Washington DC, USA June 7, 2010.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Chapter 10 Planning, Acting, and Learning. 2 Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
November 4, 2003Applied Research Laboratory, Washington University in St. Louis APOC 2003 Wuhan, China Cost Efficient Routing in Ad Hoc Mobile Wireless.
PnP Networks Self-Aware Networks Self-Aware Networks Self-Healing and Self-Defense via Aware and Vigilant Networks PnP Networks, Inc. August, 2002.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
Control-Theoretic Approaches for Dynamic Information Assurance George Vachtsevanos Georgia Tech Working Meeting U. C. Berkeley February 5, 2003.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Wireless sensor and actor networks: research challenges Ian. F. Akyildiz, Ismail H. Kasimoglu
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
The Biologically Inspired Distributed File System: An Emergent Thinker Instantiation Presented by Dr. Ying Lu.
Making complex decisions
A Crash Course in Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Distance Vector Routing
任課教授:陳朝鈞 教授 學生:王志嘉、馬敏修
Markov Decision Processes
UAV Route Planning in Delay Tolerant Networks
Markov Decision Processes
Towards Next Generation Panel at SAINT 2002
Introduction to networking
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Network Architecture By Dr. Shadi Masadeh 1.
Routing.
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Collaborative Reinforcement Learning Presented by Dr. Ying Lu

Credits Reinforcement Learning: A User’s Guide. Reinforcement Learning: A User’s Guide. Bill Smart at ICAC 2005 Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages , [Winner Best Paper Award].Collaborative Reinforcement Learning of Autonomic Behaviour

What is RL? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96]

Basic RL Model 1.Observe state, s t 2.Decide on an action, a t 3.Perform action 4.Observe new state, s t+1 5.Observe reward, r t+1 6.Learn from experience 7.Repeat Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agent ASR World

An Example: Gridworld Canonical RL domain States are grid cells 4 actions: N, S, E, W Reward for entering top right cell for every other move Maximizing sum of rewards  Shortest path In this instance +1

The Promise of RL Specify what to do, but not how to do it Through the reward function Learning “fills in the details” Better final solutions Based on actual experiences, not programmer assumptions Less (human) time needed for a good solution

Mathematics of RL Before we talk about RL, we need to cover some background material Some simple decision theory Markov Decision Processes Value functions

Making Single Decisions Single decision to be made Multiple discrete actions Each action has a reward associated with it Goal is to maximize reward Not hard: just pick the action with the largest reward State 0 has a value of 2 Sum of rewards from taking the best action from the state A B 2 1

Markov Decision Processes We can generalize the previous example to multiple sequential decisions Each decision affects subsequent decisions This is formally modeled by a Markov Decision Process (MDP) A B A A A A 10 1 B 1

Markov Decision Processes Formally, an MDP is A set of states, S = {s 1, s 2,..., s n } A set of actions, A = {a 1, a 2,..., a m } A reward function, R: S  A  S→  A transition function, We want to learn a policy,  : S →A Maximize sum of rewards we see over our lifetime

Policies There are 3 policies for this MDP 1.0 →1 →3 →5 2.0 →1 →4 →5 3.0 →2 →4 →5 Which is the best one? A B A A A A 10 1 B 1

Comparing Policies Order policies by how much reward they see 1.0 →1 →3 →5 = = →1 →4 →5 = = →2 →4 →5 = 2 – = A B A A A A 10 1 B 1

Value Functions We can define value without specifying the policy Specify the value of taking action a from state s and then performing optimally This is the state-action value function, Q A B A A 10 1 B 1 Q(0, A) = 12 Q(0, B) = -988 Q(3, A) = 1 Q(4, A) = 10 Q(1, A) = 2 Q(1, B) = 11 Q(2, A) = -990 A A How do you tell which action to take from each state?

Value Functions So, we have value function Q(s, a) = R(s, a, s’) + max a’ Q(s’, a’) In the form of Next reward plus the best I can do from the next state These extend to probabilistic actions s’ is the next state

Getting the Policy If we have the value function, then finding the best policy is easy  (s) = arg max a Q(s, a) We’re looking for the optimal policy,   (s) No policy generates more reward than   Optimal policy defines optimal value functions The easiest way to learn the optimal policy is to learn the optimal value function first

Collaborative Reinforcement Learning to Adaptively Optimize MANET Routing Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill

Overview Building autonomic distributed systems with self* properties Self-Organizing Self-Healing Self-Optimizing Add collaborative learning mechanism to self- adaptive component model Improved ad-hoc routing protocol

Introduction Autonomous distributed systems will consist of interacting components free from human interference Existing top-down management and programming solutions require too much global state Bottom up, decentralized collection of components who make their own decisions based on local information System wide self* behavior emerges from interactions

Self-* Behavior Self-adaptive components that change structure and/or behavior at run-time, adapt to discovered faults reduced performance Requires active monitoring of component states and external dependencies

Self-* Distributed Systems using Distributed (collaborative) Reinforcement Learning For complex systems, programmers cannot be expected to describe all conditions Self-adaptive behavior learnt by components Decentralized co-ordination of components to support system-wide properties Distributed Reinforcement Learning (DRL) is extension to RL and uses neighbor interactions only

Model-Based Reinforcement Learning 1.Action Reward 2. State Transition Model 3. Next State Reward Markov Decision Process = {States }, {Actions}, R(States,Actions), P (States, Actions, States)

Decentralised System Optimisation Coordinating the solution to a set of Discrete Optimisation Problems (DOPs) Components have a Partial System View Coordination Actions Actions ={delegation} U {DOP actions} U {discovery} Connection Costs

Collaborative Reinforcement Learning Advertisement Update Partial Views of Neighbours Decay Negative Feedback on State Values in the Absence of Advertisements Action Reward State Transition Model Cached Neighbour’s V-value Connection Cost

Adaptation in CRL System A feedback process to Changes in the optimal policy of any RL agent Changes in the system environment The passing time

SAMPLE: Ad-hoc Routing using DRL Probabilistic ad-hoc routing protocol based on DRL Adaptation of network traffic around areas of congestion Exploitation of stable routes Routing decisions based on local information and information obtained from neighbors Outperforms Ad-hoc On Demand Distance Vector Routing (AODV) and Dynamic Source Routing (DSR)

SAMPLE: A CRL System (I)

SAMPLE: A CRL System (II) Instead of always choosing the neighbor with the best Q value, i.e., taking the delegation action a= arg max a Q i (B, a), a neighbor is chosen probabilistically

SAMPLE: A CRL System (III) P i (s’|s, a j ) = E(C S /C A )

SAMPLE: A CRL System (IV)

Performance Metric: Maximize throughput ratio of delivered packets to undelivered packets Minimize number of transmission required per packet sent Figures 5-10

Questions/Discussions