Distributed Reinforcement Learning for a Traffic Engineering Application Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Markov Decision Process
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Chapter #8 Study Guide Answers.
Reinforcement learning (Chapter 21)
Introduction to VISSIM
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Reinforcement learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Field evaluation of an advanced brake warning system David Shinar Human Factors 1995 Presented by: Derrick Smets.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Expressway Driving. Characteristics of Expressway Driving Roadway Speed Interchanges No cross traffic Median Tollbooths Entrance/exit ramps Limited access.
MODULE 3 THE HAZARDS OF DRIVING.
1. This seminar paper is based upon the project work being carried out by the collaboration of Delphi- Delco Electronics (DDE) and General Motors Corporation.It.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
MODULE 5 Objectives: Students will learn to recognize moderate risk environments, establish vehicle speed, manage intersections, hills, and passing maneuvers.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
IntelliDriveSM Update
Benchtesting Driver Support and Collision Avoidance Systems using Naturalistic Driving Data Shane McLaughlin March 17, 2011.
Solving POMDPs through Macro Decomposition
Chapter 6 Adaptive Cruise Control (ACC)
Virginia Department of Education
MDPs (cont) & Reinforcement Learning
Distributed Q Learning Lars Blackmore and Steve Block.
Protective Braking for ACSF Informal Document: ACSF
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.
 Introduction  What is Driverless Car ?  History  Component  Action  Technology  Advantages  Disadvantages  Conclusion  Reference.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
U of Minnesota DIWANS'061 Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
CRUISE CONTROL DEVICES Presented by Anju.J.S. CRUISE CONTROL DEVICES.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Module 3: Vision and Driving Topics 2-6
1 6th ACSF meeting Tokyo, April 2016 Requirements for “Sensor view” & Environment monitoring version 1.0 Transmitted by the Experts of OICA and CLEPA.
Module 3 Brianna James Percy Antoine. Entering the Roadway/Moving to the Curb/Backing  The seven steps to safely pull from a curb. Place foot firmly.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
VEMANA INSTITUTE OF TECHNOLOGY,BANGALORE
ADVANCED DRIVER ASSISTANCE SYSTEMS
PRESENTED BY: SHAHIN HUSSAN
Autonomous CAR.
Chapter 3 Cruise Control
Biomedical Data & Markov Decision Process
Collaborative Driving and Congestion Management
Reinforcement Learning
Working Principle of Blind Spot Technology in Car
Announcements Homework 3 due today (grace period through Friday)
TUGS Jason Higuchi && Julia Yefimenko && Raudel mayorga
Instructors: Fei Fang (This Lecture) and Dave Touretzky
یادگیری تقویتی Reinforcement Learning
CS 416 Artificial Intelligence
KEYLESS ENTRY PUSH TO START BACK UP CAMERA.
Reinforcement Learning Dealing with Partial Observability
Presentation transcript:

Distributed Reinforcement Learning for a Traffic Engineering Application Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina Schweikert

Distributed Reinforcement Learning for Traffic Engineering Problem  Intelligent Cruise Control System  Lane change advisory system based on traffic patterns  Optimize a group policy by maximizing freeway utilization as shared resource  Introduce 2 new algorithms (Monte Carlo-based Piecewise Policy Iteration, Multi-Agent Distributed Q-learning) and compare their performance in this domain

Distronic Adaptive Cruise Control

 Signals from radar sensor, which scans the full width of a three-lane motorway over a distance of approximately 100m and recognizes any moving vehicles ahead  Reflection of the radar impulses and the change in their frequency enables the system to calculate the correct distance and the relative speed between the vehicles

Distronic Adaptive Cruise Control  Distance to vehicle in front reduces - cruise control system immediately reduces acceleration or, if necessary, applies the brake  Distance increases – acts as conventional cruise control system and, at speeds of between 30 and 180 km/h, will maintain the desired speed as programmed  Driver is alerted of emergencies

Distronic Adaptive Cruise Control  Automatically maintains a constant distance to the vehicle in front of it, prevent rear-end collisions  Reaction time of drivers using Distronic is up to 40 per cent faster than that of those without this assistance system

Distributed Reinforcement Learning  State – agents within sensing range  Agents share a partially observable environment  Goal - Integrate agents’ experiences to learn an observation-based policy that maximizes group performance  Agents share a common policy, giving a homogeneous population of agents

Traffic Engineering Problem  Population of cars, each with a desired traveling speed, sharing a freeway network  Subpopulation with radar capability to detect relative speeds and distances of cars immediately ahead, behind, and around them

Problem Formulation  Optimize average per time-step reward, by minimizing the per-car average loss at each time step v d (i) desired speed of car i v a (i) actual speed of car i n number of cars in simulation at time-step

State Representation  View of the world for each car represented by 8-d feature vector – relative distances and speeds of surrounding cars ALACAR CLCarCR BLBCBR

Pattern of Cars in Front of Agent ALACAR  0 – lane is clear (no car in radar range or nearest car is faster than agent’s desired speed)  1 – fastest car less than desired speed  2 – slower  3 - still slower

Pattern of Cars Behind Agent ALACAR  0 – lane is clear (no car in radar range or nearest car is slower than agent’s current speed)  1 – slowest car faster than desired speed  2 – faster  3 - still faster

Lane Change CLCARCR  0 – lane change not valid  1 – lane change valid If there is not a safe gap in front and behind, land change is illegal.

Monte Carlo-based Piecewise Policy Iteration  Performs approximate piecewise policy iteration where possible policy changes for each state are evaluated by Monte Carlo estimation  Piecewise - Policy for each state is changed one at a time, rather than in parallel  Searches the space of deterministic policies directly without representing the value function

Policy Iteration  Start with arbitrary deterministic policy for given MDP  Generate better policy by calculating best single improvement in policy possible for each state (MC)  Combine all changes to generate successor policy  Continue until no improvement is possible – optimal policy

Multi-Agent Distributed Q-Learning Q-Learning  Q-value estimates updated after each time step based on state transition after action is selected  For each time step, only one state transition and one action used to update Q-value estimates  In DQL, there can be as many state transitions per time step as there are agents

Multi-Agent Distributed Q-Learning  Takes the average backup value for a state/action pair over all agents that selected action a from state s at the last time step  Q max component of backup value is calculated over actions valid for a particular agent to select at the next time-step

Simulation for Offline Learning Advantages: o Since true state of the environment is known, can directly measure loss metric o Can be run faster, many long learning trials o Safety Learn policies offline then integrate into intelligent cruise control system with lane advisory, route planning, etc.

Traffic Simulation Specifications  Circular 3 lane freeway 13.3 miles long with 200 cars  Half follow “selfish drone” policy  Rest follow current learnt policy and active exploration decisions  Gaussian distribution of desired speeds, mean of 60 mph  Cars have low level collision avoidance, differ in lane change strategy

Experimental Results  Selfish drone policy – consistent per- step reward of (each agent traveling 11.9 below desired speed)  APPIA and DQL found policies 3-5% better  Best policies with “look ahead” only  “look behind” model provided more stable learning  “look behind” outperforms “look ahead” at times when good policy is lost