Download presentation
Presentation is loading. Please wait.
1
UAV Route Planning in Delay Tolerant Networks
Daniel Henkel, Timothy X Brown University of Colorado, Boulder Aerospace ‘07 May 8, 2007 Actual title: Route Design for UA-based Data Ferries in Delay Tolerant Wireless Networks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
2
Familiar: Dial-A-Ride
Dial-A-Ride: curb-to-curb, shared ride transportation service The Bus Receive calls Pick up and drop off passengers Minimize overall transit time Path Planning Problem Motivation Problems with TSP Solutions Queueing Theoretical Approach Formulation as MDP and Solution Methods Optimal route not trivial !
3
In context: Dial-A-UAV
Complication: infinite data at sensors; potentially two-way traffic Delay tolerant traffic! Talk tomorrow – 8am: Sensor Data Collection Sensor-1 Sensor-3 Sensor-5 Monitoring Station Sensor-2 Sensor-6 Sensor-4 Have sensors all on left side Highlight ferrying to the right side SMS locations Sparsely distributed sensors, limited radios TSP solution not optimal Our approach: Queueing and MDP theory
4
TSP’s Problem Traveling Salesman Solution One cycle visits every node
UAV pA hub pB One cycle visits every node Problem: far-away nodes with little data to send Visit them less often dA dB fA fB B TSP has no option but alternately visiting A and B. New: cycle defined by visit frequencies pi B
5
Minimize average delay
Queueing Approach Goal Minimize average delay Idea: express delay in terms of pi, then minimize over set {pi} pi as probability distribution Expected service time of any packet Inter-service time: exponential distribution with mean Ti/pi Weighted delay: A B UAV fB fA pA pB dA dB pC * Since the inter-service time is exp. distributed and we are picking up ALL waiting packets when visiting a node, the average delay for a node is the mean of the exponential distribution. * f_i/F is fractional visit probability for node i. C hub pD dC dD D fC fD
6
Solution and Algorithm
Probability of choosing node i for next visit: Implementation: deterministic algorithm 1. Set ci = 0 2. ci = ci + pi while max{ci} < 1 3. k = argmax {ci} 4. Visit node k; ck = ck-1 5. Go to 2. Performance improvement over TSP!
7
Unknown Environment What is RL? Distinguishing Features:
Learning what to do without prior training Given: high-level goal; NOT: how to reach it Improving actions on the go Distinguishing Features: Interaction with environment Trial & Error Search Concept of Rewards & Punishments Example: training dog Learns model of environment.
8
The Framework Agent Environment Performs Actions Gives rise to Rewards
Puts Agent in situations called States
9
Elements of RL Policy: what to do (depending on state)
Reward Value Model of Environment Policy: what to do (depending on state) Reward: what is good Value: what is good because it predicts reward Model: what follows what MDP: action does not depend on previous states or actions; just on current state. Source: Sutton, Barto, Reinforcement Learning – An Introduction, MIT Press, 1998
10
UA Path Planning - Simple
Goal Minimize average delay -> Find pA and pB A B UAV pA hub pB dA dB fA fB Service traffic from A and B to hub H Goal: minimize average packet delay State: traffic waiting at nodes: (tA, tB) Actions: fly to A; fly to B Reward: # packets delivered Optimal policy: # visits to A and B; depend on flow rates, distances
11
MDP If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectation:
12
RL approach to solving MDPs
Policy: Mapping from set of States to set of Actions π : S → A Sum of Rewards (:=return): from this time onwards Value function (of a state): Expected return when starting with s and following policy π. For an MDP,
13
Bellman Equation for Policy π
Evaluating E{.}; assuming deterministic policy; π solution: Action-Value Function: Value of taking action a in state s. For an MDP,
14
Optimality V and Q, both have a partial ordering on them since they are real valued. π also ordered: Concept of V* and Q*: Concept of π*: The policy π which maximizes Qπ(s,a) for all states s.
15
Reinforcement Learning - Methods
To find π*, all methods try to evaluate V/Q value functions Different Approaches: Dynamic Programming Approach Policy evaluation, improvement, iteration Monte-Carlo Methods Decisions are taken based on averaging sample returns Temporal Difference Methods (!!)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.