UAV Route Planning in Delay Tolerant Networks

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Markov Decision Process
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Distributed Q Learning Lars Blackmore and Steve Block.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Distributed Q Learning Lars Blackmore and Steve Block.
Reinforcement Learning
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Markov Decision Process (MDP)
Towards Autonomous Data Ferry Route Design in Delay Tolerant Networks
Reinforcement learning
A Crash Course in Reinforcement Learning
Reinforcement Learning
An Overview of Reinforcement Learning
Markov Decision Processes
Timothy Boger and Mike Korostelev
Biomedical Data & Markov Decision Process
Sensor Data Collection Through Unmanned Aircraft Gateways
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Using node mobility control to enhance network performance
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
یادگیری تقویتی Reinforcement Learning
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
October 6, 2011 Dr. Itamar Arel College of Engineering
Markov Decision Problems
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

UAV Route Planning in Delay Tolerant Networks Daniel Henkel, Timothy X Brown University of Colorado, Boulder Infotech @ Aerospace ‘07 May 8, 2007 Actual title: Route Design for UA-based Data Ferries in Delay Tolerant Wireless Networks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Familiar: Dial-A-Ride Dial-A-Ride: curb-to-curb, shared ride transportation service The Bus Receive calls Pick up and drop off passengers Minimize overall transit time Path Planning Problem Motivation Problems with TSP Solutions Queueing Theoretical Approach Formulation as MDP and Solution Methods Optimal route not trivial !

In context: Dial-A-UAV Complication: infinite data at sensors; potentially two-way traffic Delay tolerant traffic! Talk tomorrow – 8am: Sensor Data Collection Sensor-1 Sensor-3 Sensor-5 Monitoring Station Sensor-2 Sensor-6 Sensor-4 Have sensors all on left side Highlight ferrying to the right side SMS locations Sparsely distributed sensors, limited radios TSP solution not optimal Our approach: Queueing and MDP theory

TSP’s Problem Traveling Salesman Solution One cycle visits every node UAV pA hub pB One cycle visits every node Problem: far-away nodes with little data to send Visit them less often dA dB fA fB B TSP has no option but alternately visiting A and B. New: cycle defined by visit frequencies pi B

Minimize average delay Queueing Approach Goal Minimize average delay Idea: express delay in terms of pi, then minimize over set {pi} pi as probability distribution Expected service time of any packet Inter-service time: exponential distribution with mean Ti/pi Weighted delay: A B UAV fB fA pA pB dA dB pC * Since the inter-service time is exp. distributed and we are picking up ALL waiting packets when visiting a node, the average delay for a node is the mean of the exponential distribution. * f_i/F is fractional visit probability for node i. C hub pD dC dD D fC fD

Solution and Algorithm Probability of choosing node i for next visit: Implementation: deterministic algorithm 1. Set ci = 0 2. ci = ci + pi while max{ci} < 1 3. k = argmax {ci} 4. Visit node k; ck = ck-1 5. Go to 2. Performance improvement over TSP!

Unknown Environment What is RL? Distinguishing Features: Learning what to do without prior training Given: high-level goal; NOT: how to reach it Improving actions on the go Distinguishing Features: Interaction with environment Trial & Error Search Concept of Rewards & Punishments Example: training dog Learns model of environment.

The Framework Agent Environment Performs Actions Gives rise to Rewards Puts Agent in situations called States

Elements of RL Policy: what to do (depending on state) Reward Value Model of Environment Policy: what to do (depending on state) Reward: what is good Value: what is good because it predicts reward Model: what follows what MDP: action does not depend on previous states or actions; just on current state. Source: Sutton, Barto, Reinforcement Learning – An Introduction, MIT Press, 1998

UA Path Planning - Simple Goal Minimize average delay -> Find pA and pB A B UAV pA hub pB dA dB fA fB Service traffic from A and B to hub H Goal: minimize average packet delay State: traffic waiting at nodes: (tA, tB) Actions: fly to A; fly to B Reward: # packets delivered Optimal policy: # visits to A and B; depend on flow rates, distances

MDP If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectation:

RL approach to solving MDPs Policy: Mapping from set of States to set of Actions π : S → A Sum of Rewards (:=return): from this time onwards Value function (of a state): Expected return when starting with s and following policy π. For an MDP,

Bellman Equation for Policy π Evaluating E{.}; assuming deterministic policy; π solution: Action-Value Function: Value of taking action a in state s. For an MDP,

Optimality V and Q, both have a partial ordering on them since they are real valued. π also ordered: Concept of V* and Q*: Concept of π*: The policy π which maximizes Qπ(s,a) for all states s.

Reinforcement Learning - Methods To find π*, all methods try to evaluate V/Q value functions Different Approaches: Dynamic Programming Approach Policy evaluation, improvement, iteration Monte-Carlo Methods Decisions are taken based on averaging sample returns Temporal Difference Methods (!!)