Anticipatory Synchromodal Transportation Planning

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Dialogue Policy Optimisation

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Barge Terminal Multi-Agent Network Martijn Mes Department of Industrial Engineering and Business Information Systems University of Twente The Netherlands.

Decision Theoretic Planning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.

Planning under Uncertainty

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

Discretization Pieter Abbeel UC Berkeley EECS

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

MATE: MPLS Adaptive Traffic Engineering Anwar Elwalid, et. al. IEEE INFOCOM 2001.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Integrated Tactical and Operational Planning of Police Helicopters Martijn Mes Department of Industrial Engineering and Business Information Systems University.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Regional Traffic Simulation/Assignment Model for Evaluation of Transit Performance and Asset Utilization April 22, 2003 Athanasios Ziliaskopoulos Elaine.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Tactical Planning in Healthcare with Approximate Dynamic Programming Martijn Mes & Peter Hulshof Department of Industrial Engineering and Business Information.

INFORMS Annual Meeting San Diego 1 HIERARCHICAL KNOWLEDGE GRADIENT FOR SEQUENTIAL SAMPLING Martijn Mes Department of Operational Methods for.

Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Overview December 10, 2014 Warren B. Powell Kris Reyes Si Chen Princeton University

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.

1 Presented by Sarbagya Buddhacharya. 2 Increasing bandwidth demand in telecommunication networks is satisfied by WDM networks. Dimensioning of WDM networks.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.

INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational.

Consolidation and Last-mile Costs Reduction in Intermodal Transport Martijn Mes & Arturo Pérez Rivera Department of Industrial Engineering and Business.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Tactical Planning in Healthcare using Approximate Dynamic Programming (tactisch plannen van toegangstijden in de zorg) Peter J.H. Hulshof, Martijn R.K.

DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Urban Planning Group Implementation of a Model of Dynamic Activity- Travel Rescheduling Decisions: An Agent-Based Micro-Simulation Framework Theo Arentze,

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Continuous Control with Prioritized Experience Replay

Xi Chen Mentor: Denny Zhou In collaboration with: Qihang Lin

CS b659: Intelligent Robotics

Analytics and OR DP- summary.

Reinforcement Learning in POMDPs Without Resets

ANTICIPATORY LOGISTICS

Biomedical Data & Markov Decision Process

Markov Decision Processes

Markov Decision Processes

Course Logistics CS533: Intelligent Agents and Decision Making

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Yiyu Shi*, Wei Yao*, Jinjun Xiong+ and Lei He*

Markov Decision Problems

CS 188: Artificial Intelligence Spring 2006

Deep Reinforcement Learning

Reinforcement Learning Dealing with Partial Observability

Overview of Intermodal (Multimodal) Supply Chain Optimization and Logistics Scott J. Mason, Ph.D. Fluor Endowed Chair in Supply Chain Optimization and.

Stochastic Processes A stochastic process is a model that evolves in time or space subject to probabilistic laws. The simplest example is the one-dimensional.

Reinforcement Nisheeth 18th January 2019.

Presentation transcript:

Anticipatory Synchromodal Transportation Planning Martijn R.K. Mes & Arturo E. Pérez Rivera University of Twente INFORMS | October 24, 2017

SYNCHROMODAL TRANSPORT Source: European Gateway Services In execution similar to multi-modal transport (or inter/co), but essentially different in the planning (made by the LSP): Dynamic mode choice for each incoming order (mode-free booking) Decisions can be made at all times, even during execution, based on real-time information, e.g., water levels and traffic information Emphasis on logistics network instead of separate chains, focusing on network-wide performance over time 2017 INFORMS Annual Meeting

CASE STUDY: CTT NETWORK FROM PORT OF ROTTERDAM TO THE HINTERLAND 2017 INFORMS Annual Meeting

SYNCHROMODAL SCHEDULING: ANTICIPATORY ROUTING AND POSTPONEMENT DECISIONS 2017 INFORMS Annual Meeting

THE OPTIMIZATION PROBLEM Input: Transport network: terminals, services, schedules, durations, capacity, costs, revenues, time-horizon Current freights and probability distributions for the arrival of freights and their characteristics, for each period of the horizon Output: Expected profit for each state Scheduling policy: given the current state, which service to use for each freight for each period of the horizon State at t: St=[Fi,d,r,k,t ]∀i,d,r,k: Number of orders at (or in transit to) i, having destination d, release day r (relative to t), and time-window k (relative to r) Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

MARKOV DECISION PROCESS (MDP) MODEL The three curses of dimensionality: Many states Many possible demand realizations Many decisions ADP Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

APPROXIMATE DYNAMIC PROGRAMMING (basic structure, not what we use) Pure exploitation Deterministic optimization Statistics Simulation Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

EXPLORATION VS EXPLOITATION Result of pure exploitation: bring freight to nearest terminal and keep it there till it needs to be taken by truck to its dest. Necessary to explore… but how? (when, what, how long?) Techniques from Optimal Learning might help here… Efficient collection of information - the value of information is the expected improvement in future decision quality: Dearden et al. (1999). Model based Bayesian exploration. Gupta, S. and Miescke, K. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Frazier et al. (2008). A Knowledge-Gradient Policy for Sequential Information Collection. Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting Source of artwork: Dan Klein and Pieter Abbeel – Reinforcement Learning (2013), University of California

PRINCIPLE VALUE OF PERFECT INFORMATION (VPI) Assume you can make only one measurement, after which you have to make a final choice (the implementation decision). What choice would you make now to maximize the expected value of the implementation decision? Change which produces a change in the decision. Observation Updated estimate of the value of option 5 Change in estimated value of option 5 due to measurement of 5 Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 1 2 2017 INFORMS Annual Meeting 3 4 5

CHALLENGES Optimal learning literature difficult to apply due to the presence of a physical state (state dependent decisions) Need to learn the value of features\functions instead of states Ryzhov, I.O., et al. (2017). Bayesian exploration for approximate dynamic programming. Challenge for (time dependent) finite horizon setting: Decisions have impact on the value of states in the downstream path (we learn what we measure) Decisions have impact on the value of states in the upstream path (with on-policy control) Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Decision move to state A,B,C,D Decision to “visit” Ct A B location → t-1 C time → t t+1 D Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting t+2

CHALLENGES Result in update of V(Bt-1) and eventually of V(Ct) iteration → time → location → Result in update of V(Bt-1) and eventually of V(Ct) Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Incorporate the value of Information iteration → n+1 n A B D A t+2 t+1 t t-1 n n+1 iteration → time → location → Incorporate the value of Information Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Incorporate the value of Information iteration → n+1 n A B D A t+2 t+1 t t-1 n n+1 iteration → time → location → Incorporate the value of Information Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Value of information might depend B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Value of information might depend on the direct costs of going there Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Exploration decision might result B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Exploration decision might result in deterioration of the VFA Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Exploration decision might result B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Exploration decision might result in deterioration of the VFA Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Process continues till end of the horizon iteration → n+1 n B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Process continues till end of the horizon Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Process continues till end of the horizon iteration → n+1 n B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Process continues till end of the horizon Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

CHALLENGES Process continues till end of the horizon iteration → n+1 n B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Process continues till end of the horizon Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

VPI MODIFICATIONS Decisions: Update VFA \ belief: modified noise term 𝑥 𝑡 𝑛,𝐸1 =𝑎𝑟𝑔𝑚𝑎𝑥 𝜐 𝑡 𝐸,𝑛 𝐾 𝑡 𝑛 , 𝑆 𝑡 𝑥,𝑛 , 𝑥 𝑡 𝑛 → Offline learning 𝑥 𝑡 𝑛,𝐸2 =𝑎𝑟𝑔𝑚𝑎𝑥 𝑉 𝑡 𝑥,𝑛 𝑆 𝑡 𝑥,𝑛 + 𝜐 𝑡 𝐸,𝑛 .. 𝑥 𝑡 𝑛,𝐸3 =𝑎𝑟𝑔𝑚𝑎𝑥 𝑅 𝑡 𝑆 𝑡 𝑥,𝑛 , 𝑥 𝑡 𝑛 + 𝑉 𝑡 𝑥,𝑛 .. + 𝜐 𝑡 𝐸,𝑛 .. → Online learning 𝑥 𝑡 𝑛,𝐸4 =𝑎𝑟𝑔𝑚𝑎𝑥 (1− 𝛼 𝑛 ) 𝑅 𝑡 𝑆 𝑡 𝑥,𝑛 , 𝑥 𝑡 𝑛 + 𝑉 𝑡 𝑥,𝑛 .. + 𝛼 𝑛 𝜐 𝑡 𝐸,𝑛 .. Update VFA \ belief: modified noise term σ 𝑡 2,𝐸1 = 𝜂 𝐸 → Constant noise σ 𝑡 2,𝐸2 = (𝑇 𝑚𝑎𝑥 −𝑡) 𝑇 𝑚𝑎𝑥 𝜂 𝐸 → Linearly decreasing noise with t σ 𝑡 2,𝐸3 = σ 𝑡 2,𝑛 𝑆 𝑡 𝑥,𝑛 → Uncertainty of 𝑆 𝑡 𝑥,𝑛 (prior var of 𝑉 𝑡 𝑛 𝑆 𝑡 𝑥,𝑛 ) σ 𝑡 2,𝐸4 = (𝑇 𝑚𝑎𝑥 −𝑡) 𝑇 𝑚𝑎𝑥 𝜂 𝐸 + σ 𝑡 2,𝑛 𝑆 𝑡 𝑥,𝑛 Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

NUMERICAL EXPERIMENTS Various network instances Restricted policies: RP 1\2 with size 0.01%\0.02% of the original decision space) 2 freights at each terminal results in 2.6x108 decisions Benchmark heuristic: use intermodal service for a freight if the cost difference between the cheapest and second cheapest intermodal path covers the setup costs of the first Two experimental phases: tuning and benchmark experiments Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

TUNING EXPERIMENTS [1/2] Best ratio of two tunable parameters (noise/(initial cov)) is 104 (in line with literature) Our VPI modifications pay off: Exploration decision: include downstream rewards Update belief: use noise term equal to variance of 𝑉 𝑡 𝑛 𝑆 𝑡 𝑥,𝑛 Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

TUNING EXPERIMENTS [2/2] Learned rewards: estimated value of initial states (estimated performance of the resulting policy) Realized rewards: actual rewards resulting from a simulation of the resulting policy. Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

BENCHMARK EXPERIMENTS Benchmark without restricted decision space Benchmark with restricted decision space Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

TO REMEMBER… We designed an ADP algorithm and VFA to derive a policy that supports scheduling freight in synchromodal transport VPI significantly improves the performance of ADP, both in terms of learned values and the resulting policy. To apply VPI in a finite-horizon ADP with basis functions, exploring and updating should be done slightly more conservative than in conventional infinite-horizon VPI. For larger networks, further research in the reduction of the decision space is necessary for ADP to achieve the largest gains over competing policies in synchromodal transport. Telefoon, fax, email Alle partijen communiceren dus onderling en dat leidt tot problemen 2017 INFORMS Annual Meeting

QUESTIONS? Martijn Mes Contact Associate professor University of Twente School of Management and Governance Dept. Industrial Engineering and Business Information Systems Contact Phone: +31-534894062 Email: m.r.k.mes@utwente.nl Web: https://www.utwente.nl/bms/iebis/staff/mes/