Announcements No Friday AI seminar this week Today 3pm EA170 – Peter Bartlett on Deep Learning Event link.

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Announcements No Friday AI seminar this week Today 3pm EA170 – Peter Bartlett on Deep Learning Event link

Feedback summary Lectures are a little too fast, and key concepts need to be emphasized Need a clearer connection between assignment problems and lecture concepts Clearer instructions/explanation in assignments

Changes for the second half Adding more examples to lectures Trimmed out some material from second half of class Adding specific learning goals for each class session Making clearer in assignments: What you are expected to do How the program should behave

Today’s learning goals Explain the difference between offline solving and online learning Explain the difference between Monte Carlo and Temporal Difference learning Distinguish between on-policy and off-policy learning approaches Explain the four phases of a genetic algorithm

Reinforcement Learning examples

Problem: Driving home Legal actions Driving along the road Intentionally getting in an accident (OH-315 only) Transition probabilities At any time, someone can cut you off and you have to stop 𝑃 𝑠 𝑠,𝑎 =0.1 On OH-315, you can get in an accident at any time 𝑃 𝑐𝑟𝑎𝑠ℎ 315,𝑎 =0.1 Otherwise, you move as intended Rewards Get home: +10 (terminal) Pay toll: -5 Crash: -100 (terminal)

Driving Home as Gridworld Calculating optimal route from knowledge of problem No actions taken!

Value iteration (Offline solving) Transition probabilities ∀𝑠 𝑃 𝑠 𝑠,𝑎 =0.1 𝑃 𝑐𝑟𝑎𝑠ℎ 315,𝑎 =0.1 Otherwise, you move as intended Rewards Get home: +10 (terminal) Pay toll: -5 Crash: -100 (terminal) Transitions and rewards are known to the agent We can calculate the optimal policy, without having to take any action.

Driving Home as Gridworld 𝛾=0.9 Calculating optimal route from knowledge of problem No actions taken!

Value iteration (Offline solving) 𝛾=0.9 Calculating optimal route from knowledge of problem No actions taken!

Value iteration (Offline solving) 𝛾=0.9 Calculating optimal route from knowledge of problem No actions taken!

Value iteration (Offline solving) 𝛾=0.9 Done calculating route Now we follow it (as a reflex agent)!

Online reinforcement learning Transition probabilities ∀𝑠 𝑃 𝑠 𝑠,𝑎 =0.1 𝑃 𝑐𝑟𝑎𝑠ℎ 315,𝑎 =0.1 Otherwise, you move as intended Rewards Get home: +10 (terminal) Pay toll: -5 Crash: -100 (terminal) Transitions and rewards are unknown to the agent We must act to learn the optimal policy Approaches Monte Carlo (full sequences) Temporal Difference learning (single observations)

Monte Carlo Learning process Start in random state Choose action with current 𝜖-greedy policy Continue until terminal state is reached Sum up discounted rewards and apply to the states used

Monte Carlo Run 1 A O R -5 EX EX 10

Monte Carlo Run 1 Run 2 Run 3 A O R A O R A O R -5 -5 EX EX 10 EX EX -100

Monte Carlo Run 1 Run 2 Run 3 A O R A O R A O R -5 -5 EX EX 10 EX EX -100

Monte Carlo Run 1 Run 2 A O R A O R 𝛾 -5 𝛾 -5 𝛾=0.9 10 𝛼=0.5 10 -5 𝛾=0.9 0.81 0.9 10 𝛼=0.5 0.729 0.81 2.29 0.729 0.6561 10 1.561 𝑄 𝑡+1 𝑠,𝑎 = 1−𝛼 𝑄 𝑡 𝑠,𝑎 +𝛼 1 𝑀 𝑒∈ 𝐸 𝑠,𝑎 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝑅𝑒𝑤𝑎𝑟𝑑𝑠 𝑒 𝑠,𝑎 (0.5∗0)+ 0.5∗ 2.29+1.561 2 =0.96275

Monte Carlo 𝛾=0.9 𝛼=0.5

Temporal Difference learning Now learning from every action! Learning process Start in start state Choose next action with current 𝜖- greedy policy Take action, observe the new state and reward Use reward and estimated utility of new state to update Q value GOTO 2 (Choose another action)

TD learning – Episode 1 Action Outcome Reward Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,0],𝑁 ← 0.5 𝑄 [2,0],𝑁 +0.5 0+ 0.9∗max {0,0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,1],𝐸 ← 0.5 𝑄 [2,1],𝐸 +0.5 0+ 0.9∗max {0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action Outcome Reward -5 Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,1],𝑁 ← 0.5 𝑄 [3,1],𝑁 +0.5 −5+ 0.9∗max {0,0} =−2.5 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,2],𝑁 ← 0.5 𝑄 [3,2],𝑁 +0.5 0+ 0.9∗max {0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action Outcome Reward Q learning update rule Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,3],𝑊 ← 0.5 𝑄 [3,3],𝑊 +0.5 0+ 0.9∗max {0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 1 Action EXIT Outcome EXIT Reward 10 Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,3],𝐸𝑋𝐼𝑇 ← 0.5 𝑄 [2,3],𝐸𝑋𝐼𝑇 +0.5 10+ 0.9∗max {} =5 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,0],𝑁 ← 0.5 𝑄 [2,0],𝑁 +0.5 0+ 0.9∗max {0,0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,1],𝐸 ← 0.5 𝑄 [2,1],𝐸 +0.5 0+ 0.9∗max {0,−2.5} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward -5 Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,1],𝑁 ← 0.5 𝑄 [3,1],𝑁 +0.5 −5+0.9∗ max {0,0} =−3.75 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,2],𝑁 ← 0.5 𝑄 [3,2],𝑁 +0.5 0+ 0.9∗max {0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action Outcome Reward Q learning update rule Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,3],𝑊 ← 0.5 𝑄 [3,3],𝑊 +0.5 0+ 0.9∗max {5} =2.25 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Action EXIT Outcome EXIT Reward 10 Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,3],𝐸𝑋𝐼𝑇 ← 0.5 𝑄 [2,3],𝐸𝑋𝐼𝑇 +0.5 10+ 0.9∗max {} =7.5 𝛾=0.9 𝛼=0.5

TD learning – Episode 2 Monte Carlo TD Learning 𝛾=0.9 𝛼=0.5 Note the differences between Monte Carlo and TD learning for these cells? Why does Monte carlo have higher values? Will these values ever be the same? 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule One more episode of TD learning, now with some actions that don’t work right Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,0],𝑁 ← 0.5 𝑄 [2,0],𝑁 +0.5 0+ 0.9∗max {0,0,0} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,1],𝐸 ← 0.5 𝑄 [2,1],𝐸 +0.5 0+ 0.9∗max {0,−3.75} =0 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward -5 Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,1],𝑁 ← 0.5 𝑄 [3,1],𝑁 +0.5 −5+0.9∗ max {0,0} =−4.375 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward -5 By failing to go north and having to go through the toll booth again, we (1) get the negative reward and (2) now have it applied to leaving the square Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,2],𝑁 ← 0.5 𝑄 [3,2],𝑁 +0.5 −5+ 0.9∗max {0,0} =−2.5 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule But now the potential positive outcome from [3,3] propagates back to the North action from [3,2], bringing it back towards 0 Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,2],𝑁 ← 0.5 𝑄 [3,2],𝑁 +0.5 0+ 0.9∗max {0,2.25} =−0.2375 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule Here, failing to go West reduces the value of ([3,3], West) Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,3],𝑊 ← 0.5 𝑄 [3,3],𝑊 +0.5 0+ 0.9∗max {0,2.25} =2.1375 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action Outcome Reward Q learning update rule But now, we know the exit square is good, so we increase ([3,3], West) again Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [3,3],𝑊 ← 0.5 𝑄 [3,3],𝑊 +0.5 0+ 0.9∗max {7.5} =4.44375 𝛾=0.9 𝛼=0.5

TD learning – Episode 3 Action EXIT Outcome EXIT Reward 10 Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,3],𝐸𝑋𝐼𝑇 ← 0.5 𝑄 [2,3],𝐸𝑋𝐼𝑇 +0.5 10+ 0.9∗max {} =8.75 𝛾=0.9 𝛼=0.5

Temporal Difference learning Now learning from every action! Learning process Start in start state Choose next action with current 𝜖- greedy policy Take action, observe the new state and reward Use reward and estimated utility of new state to update Q value GOTO 2 (Choose another action)

TD learning – On-policy vs Off-policy Two different ways of getting estimated utility Off-policy (Q learning) Look at all the actions you can do in this state, and use the highest of their Q values for learning 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ Remember, no reward for entering the goal square; only get the reward when we take the EXIT action! On-policy (Sarsa) Choose a next action to take, using the current 𝜖-greedy policy Use the Q value for that action for learning What we’ve been doing in the example is off policy learning! 𝑄 𝑠,𝑎;𝑎′ ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾𝑄( 𝑠 ′ , 𝑎 ′ )

Off-policy vs On-policy learning process Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy Take an action Observe a reward Learn (using best action) Choose the next action Take the next action

TD learning – Off-policy Action Outcome Reward Q learning update rule 𝑄 𝑠,𝑎 ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾 max 𝑎 ′ ∈𝐴 𝑠 ′ 𝑄 𝑠 ′ , 𝑎 ′ 𝑄 [2,1],𝐸 ← 0.5 𝑄 [2,1],𝐸 +0.5 0+ 0.9∗max {0,−2.5} =0 𝛾=0.9 𝛼=0.5

TD learning – On-policy Action Outcome Reward We chose N as the next action, using the current \epsilon-greedy policy Sarsa update rule 𝑄 𝑠,𝑎;𝑎′ ← 1−𝛼 𝑄 𝑠,𝑎 +𝛼 𝑅 𝑠 ′ +𝛾𝑄( 𝑠 ′ , 𝑎 ′ ) 𝑄 2,1 ,𝐸;𝑁 ← 0.5 𝑄 [2,1],𝐸 +0.5 0+0.9∗−2.5 =−1.125 𝛾=0.9 𝛼=0.5

RL Concepts Offline solving vs online learning Offline – problem is known (transitions and rewards), can determine optimal policy without taking action Online – problem is unknown, need to act to learn info about transitions and rewards

RL Concepts Monte Carlo vs Temporal Difference Monte Carlo – learn from multiple complete training episodes Run from random start to finish For each state/action pair, average the sum of discounted rewards after observing it to get new Q values Temporal Difference – learn from every individual experience Take action a to leave state s, end up in s’ with reward R(s’) Use observed reward and current estimated values of each next action to update value of previous state/action pair Can be used with infinite-length problems!

RL Concepts On-policy (Sarsa) vs off-policy (Q learning) On-policy – choose a specific next action according to the current 𝜖-greedy policy, use that as the next action for TD learning Off-policy – choose the best next action, ignoring the current policy, and use that as the next action for TD learning In both cases, still actually take the next action according to current 𝜖-greedy policy!

Genetic algorithms http://rednuht.org/genetic_walkers/

Genetic algorithms Setup There’s a problem that you want to solve (e.g., making an antenna), but you’re not sure how best to search the model space. Idea Start with a whole bunch of different models Find which ones work and cross-breed them Include mutation for random outcomes

Main steps Evaluate fitness of current population Select the fittest, remove the rest Crossover Mutation

Example: Antenna ⇔ 0 2 −1 1 −2 1 Model setup 𝑥 𝑖 = direction/length of bend 𝑖 𝒙 represents a single configuration Fitness test for selection Keep the model if its bends are roughly symmetric 0 2 −1 1 −2 1 ⇔ 𝑓 𝑥 = 1 1 1 1 1 1 ⋅ 𝑥 𝑔 𝑥 = 1, −1≤𝑓 𝑥 ≤1 0, 𝑒𝑙𝑠𝑒

Antenna: Evaluation Current population 0 1 2 0 0 0 1 1 1 1 1 1 𝒇 𝒙 =𝟑 1 1 1 1 1 1 0 1 2 0 −1 −1 𝒇 𝒙 =𝟔 𝒇 𝒙 =𝟏 −1 1 −1 −1 2 −1 𝒇 𝒙 =−𝟏 −1 1 2 0 −1 −1 𝒇 𝒙 =𝟎

Antenna: Selection Current population 0 1 2 0 0 0 1 1 1 1 1 1 𝑔 𝑥 = 1, −1≤𝑓 𝑥 ≤1 0, 𝑒𝑙𝑠𝑒 Current population 0 1 2 0 0 0 𝒇 𝒙 =𝟑 1 1 1 1 1 1 0 1 2 0 −1 −1 𝒇 𝒙 =𝟔 𝒇 𝒙 =𝟏 −1 1 −1 −1 2 −1 𝒇 𝒙 =−𝟏 −1 1 2 0 −1 −1 𝒇 𝒙 =𝟎

Antenna: Crossover Current population 0 1 2 0 −1 −1 −1 1 −1 −1 2 −1 −1 1 2 0 −1 −1

Antenna: Crossover Split each model in half Take the first half of model 1 and second half of model 2, add to the population Take the first half of model 2 and second half of model 1, add to the population Added to population 0 1 2 −1 2 −1 0 1 2 0 −1 −1 −1 1 −1 0 −1 −1 −1 1 −1 −1 2 −1

Antenna: Crossover Current population 0 1 2 −1 2 −1 −1 1 −1 0 −1 −1 0 1 2 0 −1 −1 −1 1 2 −1 2 −1 −1 1 2 0 −1 −1 Newly-added Original 0 1 2 0 −1 −1 −1 1 2 0 −1 −1 −1 1 −1 −1 2 −1

Antenna: Mutation In each model generated by crossover, mutate model parameters at random Example mutation function 𝑀 𝑥 𝑖 = 𝑥 𝑖 +1, 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝛼 −2∗ 𝑥 𝑖 , 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝛽 𝑥 𝑖 , 𝑒𝑙𝑠𝑒 𝛼<1 𝛽<1−𝛼 Example mutation 𝑥 Mutations 𝑀( 𝑥 ) 0 2 −1 1 −2 1 𝑥 𝑖 𝑥 𝑖 −2 𝑥 𝑖 𝑥 𝑖 +1 𝑥 𝑖 𝑥 𝑖 +1 0 2 𝟐 𝟐 −2 𝟐 𝛼=0.25 𝛽=0.25

Antenna: Mutation Current population 0 1 −𝟒 −1 2 −1 −1 1 −1 0 −1 −1 −1 1 𝟐 0 −1 −1 0 −𝟐 2 0 𝟎 −1 −1 1 2 −1 𝟑 −1 −1 1 2 0 −1 −1 0 1 2 0 −1 −1 −1 1 2 0 −1 −1 −1 1 −1 −1 2 −1

Antenna: Back to evaluation Current population 0 1 −4 −1 2 −1 𝒇 𝒙 =−𝟑 −1 1 −1 0 −1 −1 −1 1 2 0 −1 −1 𝒇 𝒙 =−𝟑 𝒇 𝒙 =𝟎 0 −2 2 0 0 −1 −1 1 2 −1 3 −1 𝒇 𝒙 =−𝟏 −1 1 2 0 −1 −1 𝒇 𝒙 =𝟑 𝒇 𝒙 =𝟎 0 1 2 0 −1 −1 −1 1 2 0 −1 −1 𝒇 𝒙 =𝟏 𝒇 𝒙 =𝟎 −1 1 −1 −1 2 −1 𝒇 𝒙 =−𝟏

Recap: main steps Evaluate fitness of current population Select the fittest, remove the rest Crossover Mutation Two main choices to make: Good fitness test Good mutation strategy

Choosing a good fitness test Example was “Is it symmetrically bent?” But this probably isn’t a very good test for how well the antenna works! Better examples (using simulation): Send a signal to the antenna, check accuracy of receipt Send a signal from the antenna, see how far it goes In general, design a fitness test that evaluates performance on the target problem

Choosing a good mutation strategy Same tradeoff as in Reinforcement Learning: Exploration vs Exploitation High mutation rate More likely to make big changes to escape current problems More likely to lose current good models Low mutation rate Better for fine-tuning current good models But much harder to escape current problems

Mutation rate - high 𝑥 𝑖 𝑓( 𝑥 ; 𝑥 𝑖 ) Worse Better Good! Escaped the local solution for a better solution. 𝑥 𝑖 Better

Mutation rate - high 𝑥 𝑖 𝑓( 𝑥 ; 𝑥 𝑖 ) Worse Better Bad! Overshot the local good solution. 𝑥 𝑖 Better

Mutation rate - low 𝑥 𝑖 𝑓( 𝑥 ; 𝑥 𝑖 ) Worse Better Good! Found the best local solution. 𝑥 𝑖 Better

Mutation rate - low 𝑥 𝑖 𝑓( 𝑥 ; 𝑥 𝑖 ) Worse Better Bad! Hard to get out of the local not-as-good solution. 𝑥 𝑖 Better

Genetic algorithms recap Each iteration Evaluate fitness of population Select the fittest, remove the rest Crossover Mutation Use a fitness test that evaluates performance on the target problem High mutation rate – big changes Escape current problems But lose current good models Low mutation rate – small changes Fine-tune current good models Hard to escape current problems

5 minute worksheet

Next time Probability fundamentals Count and divide Conditional probability and Bayes Rule