Machine Learning Week 2 Lecture 1.

Machine Learning Week 2 Lecture 1

Quiz and Hand in data Test what you know so I can adapt!
We need data for the hand in

Any Problems Any Questions
Quiz Any Problems Any Questions

Recap Target: House Price Input: Size, Rooms, Age, Garage, …
Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set Classification (10 classes) Target: House Price Input: Size, Rooms, Age, Garage, … Data: Historical Data of House Sales Regression

Linear Models Example: Target House Price Input: Size, Rooms, Age,
Garage, … Data: Historical House Sales Weigh each input dimension to effect the target function in a good way House Price = 1234 x x size + 42 x Rooms x age x Garage θ0 x0 θ x4 θ x3 θ2 * x2 θ1 x1 (matrix product) Linear in θ Nonlinear Transform

Three Models Classification Logistic Regression (Perceptron)
w Logistic Regression Estimating Probabilities Classify y = 1 if Equivalent to w

Maximum Likelihood Use Logarithm to make into a sum. Then Optimize.
Assumption: Independent Data Likelihood Use Logarithm to make into a sum. Then Optimize. For Logistic Regression we get cross entropy error:

Convex Optimization Convex Non-convex f and g are convex, h is affine
x,f(x) y,f(y) f and g are convex, h is affine Local Minima are Global Minima x,f(x) y,f(y) f(x)+f’(x)(y-x)

Descent Methods Iteratively move toward a better solution
where f is twice continuously differentiable Iteratively move toward a better solution Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v

Simple Gradient Descent
Pick start point x LR = 0.1 Repeat 50 rounds Set v Update: x = x + LR v Descent Direction is Step size:

Learning Rate Learning Rate Learning Rate
Do you want code for generating these figures?

Gradient Descent Jump Around
Use Exact Line Search Starting From (10,1)

Gradient Checking If You Use Gradient Descent
Compute Gradient Correctly. Choose small h and compute Use this two sided formula. Reduces the estimation error significantly. Move epsilon out and check n-dimensional gradient: Use formula for each variable Usually works well

Handin 1. It comes online after class today
Supervised Learning It comes online after class today Include Matlab examples but not a long intro. Google is your friend. Questions are always welcome Get Busy

Today Learning feasibility Probabilistic Approach Learning Formalized

Learning Diagram Unknown Target f Data Set (x1,y1,...,xn,yn)
Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Impossibility of Learning!
x1 x2 x3 f(x) 1 ? What is f? There are 256 potential functions 8 of them has in sample error 0 Assumptions are needed

No Free Lunch "All models are wrong, but some models are useful.”
George Box Machine Learning has many different models and algorithms. There is no single best model that works best for all problems (No Free Lunch Theorem) Assumptions that works well in one domain may fail in another.

Probabilistic Games

Probabilistic Approach
Repeat N times independently μ is unknown Sample mean: ν #heads/N What does sample mean say about μ? Sample: h,h,h,t,t,h,t,t,h With Certainty? Nothing really Probabilistically? Yes sample mean is likely close to bias

Hoeffdings Inequality Binary Variables
Sample mean is probably close to μ coin bias μ Probability increase with #samples N Sample mean ν Hoeffdings Inequality Bound is independent of sample mean and actual probability, e.g. the probability distribution P(x) Sample mean is probably approximately correct PAC

Classification Connection Testing a Hypothesis
Unknown Target Fixed Hypothesis Probability Distribution over x is probability of picking x such that f(x) ≠ h(x) is probability of picking x such that f(x) = h(x) μ is the sum of the probability of all the points X where hypothesis is wrong This is just the sum Sample Mean Out of sample Error

Unknown Input Probability Distribution P(x)
Learning Diagram Unknown Target f Unknown Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Coins to hypotheses Sample size N: h,h,h,t,t,h,t,t,h Sample mean
unknown μ

Not Learning Yet Hypothesis fixed before seeing data
Every hypothesis has its own error (different coin for each hypothesis) In learning we have a training algorithm that picks the “best” hypothesis from the set We are only verifying fixed hypothesis Hoeffding has left the building again.

Coin Analogy – Exercise 1.10 Book
Flip a fair coin 10 times What is Probability of 10 heads? Repeat 1000 times (1000 coins) What is the probability that some coin has 10 heads? Approximately 63%

Crude Approach Apply Union Bound
≤ P(True for some hypothesis) . Apply Union Bound and Then Hoeffding to each one

Result Classification Problem. Error is f(x)≠h(x)
Finite Hypothesis set with M hypotheses. Data Set with N points It explains the idea of what we are looking for (model complexity is a factor it seems) Our “simple” linear models have infinite size hypothesis sets…

Input Probability Distribution P(x)
New Learning Diagram Unknown Target f Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set finite X

Learning Feasibility Deterministic/No assumptions NOT SO MUCH
Probabilisticly YES: Generalization: Out of sample error Close to In Sample Error Make In Sample Error Small If target function is complex learning should be harder? But the bound does not seem to care. But complex functions needs complex hypothesis sets which should increase their complexity, e.g. M is a very simple and crude measure.

Error Functions User Specified, Heavily Problem Dependent.
Identity System, Fingerprints. Is the person who he says he is. h(x)/f(x) Lying True Estimate Lying True Negative False Negative Estimate True False Positive True Positive Walmart. Discount for a given person Error Function CIA Access (Friday bar stock) Error Function h(x)/f(x) Lying True Est. Lying Est. True h(x)/f(x) Lying True Est. Lying Est. True 1000 1 1 1000

Error Functions If Not Given
Base it on making problem “solvable”.. Making the problem smooth and convex seems like a good idea. Least Squares Linear Regression was very nice indeed. Base on assumptions about target and noise Logistic Regression: Gives Cross Entropy Assume linear and Gaussian noise: Gives Least Squares If no one tells you.

Unknown Probability Distribution P(x)
Formalize Everything Unknown Target Unknown Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Unknown Probability Distribution P(x)
Final Diagram Unknown Target Unknown Probability Distribution P(x) P(y | x) Learn Importance Data Set Learning Algorithm Hypothesis Set If x has very low probability then it is not really gonna count. Error Measure e Final Hypothesis

Words on out of sample error
Imagine X,y are finite sets

Quick Summary Learning Without Assumptions is impossible
Probabilistically learning is possible Hoeffding bound Work needed for infinite hypothesis spaces! Error function depend on problem Formalized Learning Approach Ensure out of sample error is close to in sample error Minimize in sample error Complexity of hypothesis set (size M currently) matters More data helps

Machine Learning Week 2 Lecture 1.

Similar presentations

Presentation on theme: "Machine Learning Week 2 Lecture 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning Week 2 Lecture 1.

Similar presentations

Presentation on theme: "Machine Learning Week 2 Lecture 1."— Presentation transcript:

Similar presentations

About project

Feedback