Deep Learning.

Deep Learning

About the Course CS6501: Vision and Language
Instructor: Vicente Ordonez Website: Location: Thornton Hall E316 Times: Tuesday - Thursday 12:30PM - 1:45PM Faculty Office hours: Tuesdays 3 - 4pm (Rice 310) Discuss in Piazza:

Today Quick review into Machine Learning. Linear Regression
Neural Networks Backpropagation

Linear Regression 𝑎 𝑗 = 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 𝑎= 𝑊 𝑇 𝑥+𝑏
Prediction, Inference, Testing 𝑎 𝑗 = 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 𝑎= 𝑊 𝑇 𝑥+𝑏 Training, Learning, Parameter estimation Objective minimization 𝐿 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑙( 𝑎 𝑑 , 𝑦 (𝑑) ) 𝐷={( 𝑥 𝑑 , 𝑦 𝑑 )} 𝑊 ∗ , 𝑏 ∗ =argmin 𝐿(𝑊, 𝑏)

total revenue international
Linear Regression Example: Hollywood movie data input variables x output variables y production costs promotional costs genre of the movie box office first week total book sales total revenue USA total revenue international 𝑥 1 (1) 𝑥 2 (1) 𝑥 3 (1) 𝑥 4 (1) 𝑥 5 (1) 𝑥 6 (1) 𝑥 7 (1) 𝑥 1 (2) 𝑥 2 (2) 𝑥 3 (2) 𝑥 4 (2) 𝑥 5 (2) 𝑥 6 (2) 𝑥 7 (2) 𝑥 1 (3) 𝑥 2 (3) 𝑥 3 (3) 𝑥 4 (3) 𝑥 5 (3) 𝑥 6 (3) 𝑥 7 (3) 𝑥 1 (4) 𝑥 2 (4) 𝑥 3 (4) 𝑥 4 (4) 𝑥 5 (4) 𝑥 6 (4) 𝑥 7 (4) 𝑥 1 (5) 𝑥 2 (5) 𝑥 3 (5) 𝑥 4 (5) 𝑥 5 (5) 𝑥 6 (5) 𝑥 7 (5)

Linear Regression Example: Hollywood movie data input variables x output variables y production costs promotional costs genre of the movie box office first week total book sales total revenue USA total revenue international 𝑥 1 (1) 𝑥 2 (1) 𝑥 3 (1) 𝑥 4 (1) 𝑥 5 (1) 𝑦 1 (1) 𝑦 2 (1) 𝑥 1 (2) 𝑥 2 (2) 𝑥 3 (2) 𝑥 4 (2) 𝑥 5 (2) 𝑦 1 (2) 𝑦 2 (2) 𝑥 1 (3) 𝑥 2 (3) 𝑥 3 (3) 𝑥 4 (3) 𝑥 5 (3) 𝑦 1 (3) 𝑦 2 (3) 𝑥 1 (4) 𝑥 2 (4) 𝑥 3 (4) 𝑥 4 (4) 𝑥 5 (4) 𝑦 1 (4) 𝑦 2 (4) 𝑥 1 (5) 𝑥 2 (5) 𝑥 3 (5) 𝑥 4 (5) 𝑥 5 (5) 𝑦 1 (5) 𝑦 2 (5)

Linear Regression Example: Hollywood movie data input variables x output variables y production costs promotional costs genre of the movie box office first week total book sales total revenue USA total revenue international 𝑥 1 (1) 𝑥 2 (1) 𝑥 3 (1) 𝑥 4 (1) 𝑥 5 (1) 𝑦 1 (1) 𝑦 2 (1) training data 𝑥 1 (2) 𝑥 2 (2) 𝑥 3 (2) 𝑥 4 (2) 𝑥 5 (2) 𝑦 1 (2) 𝑦 2 (2) 𝑥 1 (3) 𝑥 2 (3) 𝑥 3 (3) 𝑥 4 (3) 𝑥 5 (3) 𝑦 1 (3) 𝑦 2 (3) 𝑥 1 (4) 𝑥 2 (4) 𝑥 3 (4) 𝑥 4 (4) 𝑥 5 (4) 𝑦 1 (4) 𝑦 2 (4) test data 𝑥 1 (5) 𝑥 2 (5) 𝑥 3 (5) 𝑥 4 (5) 𝑥 5 (5) 𝑦 1 (5) 𝑦 2 (5)

Linear Regression – Least Squares
𝑎 𝑗 = 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 𝑎= 𝑊 𝑇 𝑥+𝑏 Training, Learning, Parameter estimation Objective minimization 𝐿 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑙( 𝑎 𝑑 , 𝑦 (𝑑) ) 𝐷={( 𝑥 𝑑 , 𝑦 𝑑 )} 𝑊 ∗ , 𝑏 ∗ =argmin 𝐿(𝑊, 𝑏)

𝑎 𝑗 = 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 𝑎= 𝑊 𝑇 𝑥+𝑏 Training, Learning, Parameter estimation Objective minimization 𝐿 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑎 𝑑 − 𝑦 𝑑 2 𝐷={( 𝑥 𝑑 , 𝑦 𝑑 )} 𝑊 ∗ , 𝑏 ∗ =argmin 𝐿(𝑊, 𝑏)

𝐿 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑎 𝑑 − 𝑦 𝑑 2 𝑎 𝑗 (𝑑) = 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 𝐿 𝑗 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 − 𝑦 𝑗 𝑑 2

𝐿 𝑗 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 − 𝑦 𝑗 𝑑 2 𝑊 ∗ , 𝑏 ∗ =argmin 𝐿(𝑊, 𝑏) 𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 𝑊, 𝑏 = 𝑑𝐿 𝑑 𝑤 𝑢𝑣 ( 𝑑=1 |𝐷| 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 − 𝑦 𝑗 𝑑 2 )

𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 𝑊, 𝑏 = 𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 ( 𝑑=1 |𝐷| 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 − 𝑦 𝑗 𝑑 2 ) 𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 𝑊, 𝑏 = ( 𝑑=1 |𝐷| 𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 (𝑑) + 𝑏 𝑗 − 𝑦 𝑗 𝑑 2 ) =0 … 𝑊= (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑌

Neural Network with One Layer
𝑊=[𝑤 𝑗𝑖 ] 𝑥 1 𝑎 1 𝑥 2 𝑥 3 𝑎 2 𝑥 4 𝑥 5 𝑎 𝑗 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 )

𝐿 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑎 𝑑 − 𝑦 𝑑 2 𝑎 𝑗 (𝑑) =𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 𝑑 + 𝑏 𝑗 ) 𝐿 𝑗 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 𝑑 + 𝑏 𝑗 )− 𝑦 𝑗 𝑑 2

𝐿 𝑗 𝑊, 𝑏 = 𝑑=1 |𝐷| 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 𝑑 + 𝑏 𝑗 )− 𝑦 𝑗 𝑑 2 𝑑 𝐿 𝑗 𝑑 𝑤 𝑢𝑣 = 𝑑=1 |𝐷| 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 𝑑 + 𝑏 𝑗 )− 𝑦 𝑗 𝑑 2 =0 We can compute this derivative but there is no closed-form solution for W when dL/dw = 0

Gradient Descent 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤
2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Gradient Descent expensive 𝜆=0.01 for e = 0, num_epochs do end
Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 𝑙(𝑤,𝑏)

Stochastic Gradient Descent
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑 𝐿 𝐵 (𝑤,𝑏)/𝑑𝑤 𝑑 𝐿 𝐵 (𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝐿 𝐵 (𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿 𝐵 (𝑤,𝑏)= 𝑖=1 𝐵 𝑙(𝑤,𝑏)

Deep Learning Lab 𝑎 𝑗 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑖 𝑤 𝑗𝑖 𝑥 𝑖 + 𝑏 𝑗 )

Two Layer Neural Network
𝑎 1 𝑥 1 𝑎 2 𝑥 2 𝑦 1 𝑦 1 𝑎 3 𝑥 3 𝑎 4 𝑥 4

Forward Pass 𝑎 1 𝑥 1 𝑎 2 𝑥 2 𝑦 1 𝑦 1 𝑎 3 𝑥 3 𝑎 4 𝑥 4
𝑧 𝑖 = 𝑖=0 𝑛 𝑤 1𝑖𝑗 𝑥 𝑖 + 𝑏 1 𝑎 𝑖 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑( 𝑧 𝑖 ) 𝑝 1 = 𝑖=0 𝑛 𝑤 2𝑖 𝑎 𝑖 + 𝑏 2 𝑦 1 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑( 𝑝 𝑖 ) 𝑎 1 𝑥 1 𝑎 2 𝑥 2 𝐿𝑜𝑠𝑠=𝐿( 𝑦 1 , 𝑦 1 ) 𝑦 1 𝑦 1 𝑎 3 𝑥 3 𝑎 4 𝑥 4

Backward Pass - Backpropagation
𝜕𝐿 𝜕 𝑥 𝑘 =( 𝜕 𝜕 𝑥 𝑘 𝑖=0 𝑛 𝑤 1𝑖𝑗 𝑥 𝑖 + 𝑏 1 ) 𝜕𝐿 𝜕 𝑧 𝑖 𝜕𝐿 𝜕 𝑤 1𝑖𝑗 = 𝜕 𝑥 𝑘 𝜕 𝑤 1𝑖𝑗 𝜕𝐿 𝜕 𝑥 𝑘 𝜕𝐿 𝜕 𝑧 𝑖 = 𝜕 𝜕 𝑧 𝑖 𝑆𝑖𝑔𝑚𝑜𝑖𝑑( 𝑧 𝑖 ) 𝜕𝐿 𝜕 𝑎 𝑘 GradInputs 𝜕𝐿 𝜕 𝑎 𝑘 =( 𝜕 𝜕 𝑎 𝑘 𝑖=0 𝑛 𝑤 2𝑖 𝑎 𝑖 + 𝑏 2 ) 𝜕𝐿 𝜕 𝑝 1 𝜕𝐿 𝜕 𝑤 2𝑖 = 𝜕 𝑎 𝑘 𝜕 𝑤 2𝑖 𝜕𝐿 𝜕 𝑎 𝑘 𝜕𝐿 𝜕 𝑝 1 = 𝜕 𝜕 𝑝 1 𝑆𝑖𝑔𝑚𝑜𝑖𝑑( 𝑝 𝑖 ) 𝜕𝐿 𝜕 𝑦 1 𝑎 1 𝑥 1 𝑎 2 𝑥 2 𝜕𝐿 𝜕 𝑦 1 = 𝜕 𝜕 𝑦 1 𝐿( 𝑦 1 , 𝑦 1 ) 𝑦 1 𝑦 1 𝑎 3 𝑥 3 𝑎 4 𝑥 4 GradParams

Layer-wise implementation

Automatic Differentiation
You only need to write code for the forward pass, backward pass is computed automatically.

Questions?

Deep Learning.

Similar presentations

Presentation on theme: "Deep Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning.

Similar presentations

Presentation on theme: "Deep Learning."— Presentation transcript:

Similar presentations

About project

Feedback