Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q-Learning for Policy Improvement in Network Routing

Similar presentations


Presentation on theme: "Q-Learning for Policy Improvement in Network Routing"— Presentation transcript:

1 Q-Learning for Policy Improvement in Network Routing
W. Chen & S. Meyn Dept ECE & CSL University of Illinois Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why?

2 Algorithms for dynamic routing: Visualization and Optimization
Q-learning Techniques for Net Opt W. Chen & S. Meyn ACHIEVEMENT DESCRIPTION Implementation – Consensus algorithms & Information distribution Adaptation – kernel based TD-learning and Q-learning Integration with Network Coding projects: Code around network hot-spots What is the state of the art and what are its limitations? Control by learning: obtain a routing/scheduling policy for the network that is approximately optimal with respect to delay & throughput. Can these techniques be extended to wireless models? MAIN RESULT: Q-Learning for network routing: by observing the network behavior, we can approximate the optimal policy over a class of policies with restricted complexity, and restricted information. STATUS QUO IMPACT Near optimal performance with simple solution: Q-learning with steepest descent or Newton Raphson Method via stochastic approximation Theoretical analysis for the convergence. Simulation experiments are on-going. KEY NEW INSIGHTS: Extend to wireless? YES Complexity is similar to MaxWeight. Policies are distributed and throughput optimal. Learn the approximately optimal solution by Q-learning is feasible, even for complex networks New application: Q-learning and TD-learning for power control Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why? HOW IT WORKS: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation NEW INSIGHTS NEXT-PHASE GOALS Un-consummated union challenge: Integrate coding and resource allocation Generally, solutions to complex decision problems should offer insight Algorithms for dynamic routing: Visualization and Optimization

3 Problem Decentralized resource allocation for a dynamic network is complex. How to achieve the near optimal performance (in delay or power) while maintaining throughput optimality and distributed implementation? Motivation Optimality control theory is intrinsically based on associated value functions. We can approximate the value function using either a relaxation technique, or other methods. Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why? Solution Learn an approximation of the value function over a parametric family (which can be chosen by prior knowledge) and get an approximate optimal routing policy for the network.

4 How to approximate the optimal function h?
h-MaxWeight Method (Meyn 2008): h-MaxWeight uses a perturbation technique to generate a rich class of universally stabilizing policies. Any monotone and convex function can be used as the function h. Relaxation based h-MaxWeight: Find function h through relaxation of the fluid value function. Q-learning based h-MaxWeight: Approximate the optimal function h over a parametric family by local learning. Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why? How to approximate the optimal function h?

5 Find Function h by Local Learning
Approximate value function by a general quadratic: Require in a sub-region of the state space. Architecture: to choose , Fluid value function approximation Diffusion value function approximation Shortest path information Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why?

6 Q-learning based h-MaxWeight Method: Main Idea and Procedure
Goal: Minimize the Bellman error over parametric family of value function approximation. Procedure: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why?

7 Define value function in a subspace:
Steepest descent method: Stochastic approximation: approximates Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why? Optimal parameter: MaxWeight

8 Example: local approximation for a simple network
Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why?

9 Summaries and challenges
KEY CONCLUSION Approximate the optimal scheduling/routing policies for network by local learning. Challenges CAN WE IMPLEMET? How to construct implementable approximations for the optimal routing/scheduling policies? CAN WE CODE? How to combine the routing/scheduling policies with network coding in ITMANET? References S. Meyn. Stability and asymptotic optimality of generalized MaxWeight policies. SIAM J. Con Optim., 47(6):3259–3294, 2009 W. Chen et. al. Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management. Accepted to the 48th IEEE Conference on Decision and Control, 2009. W. Chen, et. al. Coding and Control for Communication Networks. Invited paper to appear in the special issue of Queueing systems. S. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. P. Mehta. and S. Meyn. Q-learning and Pontryagin’s Minimum Principle. Accepted to the 48th IEEE Conference on Decision and Control, 2009 Include notes that would be helpful to the reader or someone (other than yourself) who needs to understand the details. Ideally the chart will be useful even without the author to explain the details. Also include references to published papers plus URLs to online versions if available. Keep in mind that somewhere or other in this chart, each of these questions should be answered: 1. What technical challenge is being undertaken on behalf of the project 2. Why is it hard and what are the open problems 3. How has this problem been addressed in the past 4. What new intellectual tools are being brought to bear on the problem 5. What is the main intermediate achievement 6. How and when does this achievement align with the project roadmap (end-of-phase or end-of-project goal) 7. What are the even long-term objectives and consequences? 8. Which thrusts and SOW tasks does this contribution fit under and why?


Download ppt "Q-Learning for Policy Improvement in Network Routing"

Similar presentations


Ads by Google