Morteza Kheirkhah University College London

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
RL for Large State Spaces: Value Function Approximation
Data Mining Classification: Alternative Techniques
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)
Reinforcement Learning
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
University of Windsor School of Computer Science Topics in Artificial Intelligence Fall 2008 Sept 11, 2008.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Reinforcement learning (Chapter 21)
Chapter 18 Connectionist Models
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Chapter 6 Neural Network.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Machine Learning Supervised Learning Classification and Regression
CS 9633 Machine Learning Support Vector Machines
Reinforcement Learning
Neural Network Architecture Session 2
Introduction of Reinforcement Learning
Deep Learning Amin Sobhani.
Deep Reinforcement Learning
Learning with Perceptrons and Neural Networks
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning in POMDPs Without Resets
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Deep reinforcement learning
What is an ANN ? The inventor of the first neuro computer, Dr. Robert defines a neural network as,A human brain like system consisting of a large number.
Reinforcement learning with unsupervised auxiliary tasks
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Data Mining Practical Machine Learning Tools and Techniques
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
یادگیری تقویتی Reinforcement Learning
Double Dueling Agent for Dialogue Policy Learning
Reinforcement Learning
Lecture 6: Introduction to Machine Learning
Reinforcement Learning
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Machine learning overview
AI Applications in Network Congestion Control
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Reinforcement Learning (2)
Instructor: Vincent Conitzer
Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang
Reinforcement Learning (2)
A Deep Reinforcement Learning Approach to Traffic Management
Presentation transcript:

Morteza Kheirkhah University College London A Reinforcement Learning Algorithm for Cognitive Network Optimiser (CNO) in the 5G-MEDIA Project Morteza Kheirkhah University College London September 2019 5G-MEIDA Coding Days

Topics Machine learning techniques Reinforcement learning (RL) algorithm RL vs other ML techniques (e.g. supervised and semi-supervised) RL with offline vs online learning RL variants (e.g. A3C and DQN) Training A3C with parallel agents Key metrics (learning rate, loss function, and entropy) CNO Design for UC2(state, action, reward) Hands-on exercises

Machine learning techniques Supervised It needs labeled datasets to train a model Unsupervised It does not need labeled datasets, but it needs datasets Semi-supervised It works with partially labeled datasets Reinforcement learning It does not necessarily need datasets to be trained. It can learn and behave appropriately towards a set of objectives in an online fashion

RL key terminologies Action (A): All the possible decisions that the agent can take. State (S): Current situation returned by the environment. Current situation is defined by set of parameters Reward (R): An immediate return send back from the environment to evaluate the last action Policy (π): The strategy that the agent employs to determine next action based on the current state. Value (V): The expected long-term return with discount, as opposed to the short-term reward R. Vπ(s) is defined as the expected long-term return of the current state under policy π.

Reinforcement learning algorithm An RL agent takes an action (at a particular condition) towards a particular goal, it then receives a reward for it. This way, the agent tends to adjust its actions (behavior) to achieve a higher reward. To make choices, the agent relies both on learnings from past feedback and exploration of new tactics (actions) that may present a larger payoff. This implies that the agent’s overall objectives can be reflected in the reward function.

RL vs other ML techniques Unlike supervised/unsupervised learning algorithms that require diverse datasets, an RL algorithm requires diverse environments to be trained Exposing a RL algorithm to a large number of conditions in the real environment is a time-consuming process It is typical to train an RL algorithm in a simulated and emulated environment

RL with offline vs online learning There are several ways to deploy an RL algorithm in real systems: You can immediately deploy your algorithm in the real environment and lets your algorithm learn in an online fashion You can initially train your algorithm offline by simulating real environment and then deploy your trained model in a real system with no more learning You can initially train your model offline and then deploy it in a real system and continue learning afterwards

RL variants Asynchronous Advantage Actor-Critic (A3C)* Tabular Q-Learning* Deep Q-Network (DQN) Deep Deterministic Policy Gradient (DDPG) REINFORCE State-Action-Reward-State-Action (SARSA) …

RL variants Tabular Q-Learning Algorithm: All possible states and actions are stored in a table A possible drawback is that the algorithm do not scale well when the state-action space is very large A solution to the above issue is to decrease the state space by making some simplifications that often translate to unrealistic conditions. For example, an RL algorithm assumes that the next action can be decided based on current state. This often results in poor prediction of network condition and in turn wrong action may be selected

RL variants Asynchronous Advantage Actor-Critic (A3C): An reinforcement learning algorithm utilising a neural network rather than the explicit state-action tables A3C runs two separate neural networks simultaneously: actor and critic networks The critic gives feedback to the actor for the action has been taken in order to allow the actor to take a better action given a particular state

Training A3C with parallel agents A3C can be trained via parallel agents asynchronously and thus learns more efficiently and quickly (by several order of magnitudes) The agents continually send their local information (e.g. state, action, and reward) to the central agent which aggregates all the information and updates a single learning model. The model is populated to all agents asynchronously

Key metrics: Learning rate The amount that the weights are updated during training is referred to as the “learning rate”. It controls how quickly or slowly a neural network model learns a problem During training, the backpropagation of error estimates the amount of error for which the weights of a node in the network are responsible. Instead of updating the weight with the full amount it is scaled by the learning rate For example a learning rate of 0.1 means that weights in the network are updated 0.1 * (estimated weight error) or 10% of the estimated weight error each time the weights are updated

Key metrics: Loss function A loss function is used for measuring the discrepancy between the target (expected) output and the computed output, after a training example has propagated through the network The result of the loss function can be a small value if the value of computed and expected outputs are close

Key metrics: Entropy We can think of Entropy as a surprising scenario, when an agent is really puzzled. For example when at a particular state a set of large actions with a very close probability can be selected. Agent should explore other possibilities to get better knowledge of its state-action relationship. When something surprising happens entropy should increase given that the value of entropy is a negative input to the loss function. This ensures that an agent explores other possibilities and not simply converge quickly on a particular action for a given state.

CNO goals for UC2 (Smart + Remote Media Production) The goal is to maximize the Quality of Experience (QoE) perceived by the user To achieve the above goal we follow two key objectives: Decrease packet drop by appropriately selecting the video quality level according to current network condition Change video quality with smoothness. In other words, jumping from a very high video quality (bitrate) level to a very low one tends to be minimized

CNO design for UC2 State: Action: Reward: Bitrate (video quality) of last chunk History of available network capacity Lossrate incurred for past K chunks Action: Video quality for the next chunk Reward: Bitrate – α (lossrate) – β (smoothness)

Time to get your hands dirty with code No more theory  Time to get your hands dirty with code CNO source code is available with the following link: https://github.com/mkheirkhah/5gmedia/tree/deployed/cno

5G-MEDIA Project Meeting in Munich March 2019 5G-MEDIA Project Meeting in Munich