Machine Learning Week 2 Lecture 1
Quiz and Hand in data Test what you know so I can adapt! We need data for the hand in
Any Problems Any Questions Quiz Any Problems Any Questions
Recap Target: House Price Input: Size, Rooms, Age, Garage, … Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set 5 0 4 1 9 2 1 3 1 4 Classification (10 classes) Target: House Price Input: Size, Rooms, Age, Garage, … Data: Historical Data of House Sales Regression
Linear Models Example: Target House Price Input: Size, Rooms, Age, Garage, … Data: Historical House Sales Weigh each input dimension to effect the target function in a good way House Price = 1234 x 1 + 88 x size + 42 x Rooms - 666 x age + 0.01 x Garage θ0 x0 θ4 x4 θ3 x3 θ2 * x2 θ1 x1 (matrix product) Linear in θ Nonlinear Transform
Three Models Classification Logistic Regression (Perceptron) w Logistic Regression Estimating Probabilities Classify y = 1 if Equivalent to w
Maximum Likelihood Use Logarithm to make into a sum. Then Optimize. Assumption: Independent Data Likelihood Use Logarithm to make into a sum. Then Optimize. For Logistic Regression we get cross entropy error:
Convex Optimization Convex Non-convex f and g are convex, h is affine x,f(x) y,f(y) f and g are convex, h is affine Local Minima are Global Minima x,f(x) y,f(y) f(x)+f’(x)(y-x)
Descent Methods Iteratively move toward a better solution where f is twice continuously differentiable Iteratively move toward a better solution Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v
Simple Gradient Descent Pick start point x LR = 0.1 Repeat 50 rounds Set v Update: x = x + LR v Descent Direction is Step size:
Learning Rate Learning Rate Learning Rate Do you want code for generating these figures?
Gradient Descent Jump Around Use Exact Line Search Starting From (10,1)
Gradient Checking If You Use Gradient Descent Compute Gradient Correctly. Choose small h and compute Use this two sided formula. Reduces the estimation error significantly. Move epsilon out and check n-dimensional gradient: Use formula for each variable Usually works well
Handin 1. It comes online after class today Supervised Learning It comes online after class today Include Matlab examples but not a long intro. Google is your friend. Questions are always welcome Get Busy 5 0 4 1 9 2 1 3 1 4
Today Learning feasibility Probabilistic Approach Learning Formalized
Learning Diagram Unknown Target f Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set
Impossibility of Learning! x1 x2 x3 f(x) 1 ? What is f? There are 256 potential functions 8 of them has in sample error 0 Assumptions are needed
No Free Lunch "All models are wrong, but some models are useful.” George Box Machine Learning has many different models and algorithms. There is no single best model that works best for all problems (No Free Lunch Theorem) Assumptions that works well in one domain may fail in another.
Probabilistic Games
Probabilistic Approach Repeat N times independently μ is unknown Sample mean: ν #heads/N What does sample mean say about μ? Sample: h,h,h,t,t,h,t,t,h With Certainty? Nothing really Probabilistically? Yes sample mean is likely close to bias
Hoeffdings Inequality Binary Variables Sample mean is probably close to μ coin bias μ Probability increase with #samples N Sample mean ν Hoeffdings Inequality Bound is independent of sample mean and actual probability, e.g. the probability distribution P(x) Sample mean is probably approximately correct PAC
Classification Connection Testing a Hypothesis Unknown Target Fixed Hypothesis Probability Distribution over x is probability of picking x such that f(x) ≠ h(x) is probability of picking x such that f(x) = h(x) μ is the sum of the probability of all the points X where hypothesis is wrong This is just the sum Sample Mean - Out of sample Error
Unknown Input Probability Distribution P(x) Learning Diagram Unknown Target f Unknown Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set
Coins to hypotheses Sample size N: h,h,h,t,t,h,t,t,h Sample mean unknown μ
Not Learning Yet Hypothesis fixed before seeing data Every hypothesis has its own error (different coin for each hypothesis) In learning we have a training algorithm that picks the “best” hypothesis from the set We are only verifying fixed hypothesis Hoeffding has left the building again.
Coin Analogy – Exercise 1.10 Book Flip a fair coin 10 times What is Probability of 10 heads? Repeat 1000 times (1000 coins) What is the probability that some coin has 10 heads? Approximately 63%
Crude Approach Apply Union Bound ≤ P(True for some hypothesis) . Apply Union Bound and Then Hoeffding to each one
Result Classification Problem. Error is f(x)≠h(x) Finite Hypothesis set with M hypotheses. Data Set with N points It explains the idea of what we are looking for (model complexity is a factor it seems) Our “simple” linear models have infinite size hypothesis sets…
Input Probability Distribution P(x) New Learning Diagram Unknown Target f Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set finite X
Learning Feasibility Deterministic/No assumptions NOT SO MUCH Probabilisticly YES: Generalization: Out of sample error Close to In Sample Error Make In Sample Error Small If target function is complex learning should be harder? But the bound does not seem to care. But complex functions needs complex hypothesis sets which should increase their complexity, e.g. M is a very simple and crude measure.
Error Functions User Specified, Heavily Problem Dependent. Identity System, Fingerprints. Is the person who he says he is. h(x)/f(x) Lying True Estimate Lying True Negative False Negative Estimate True False Positive True Positive Walmart. Discount for a given person Error Function CIA Access (Friday bar stock) Error Function h(x)/f(x) Lying True Est. Lying Est. True h(x)/f(x) Lying True Est. Lying Est. True 1000 1 1 1000
Error Functions If Not Given Base it on making problem “solvable”.. Making the problem smooth and convex seems like a good idea. Least Squares Linear Regression was very nice indeed. Base on assumptions about target and noise Logistic Regression: Gives Cross Entropy Assume linear and Gaussian noise: Gives Least Squares If no one tells you.
Unknown Probability Distribution P(x) Formalize Everything Unknown Target Unknown Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set
Unknown Probability Distribution P(x) Final Diagram Unknown Target Unknown Probability Distribution P(x) P(y | x) Learn Importance Data Set Learning Algorithm Hypothesis Set If x has very low probability then it is not really gonna count. Error Measure e Final Hypothesis
Words on out of sample error Imagine X,y are finite sets
Quick Summary Learning Without Assumptions is impossible Probabilistically learning is possible Hoeffding bound Work needed for infinite hypothesis spaces! Error function depend on problem Formalized Learning Approach Ensure out of sample error is close to in sample error Minimize in sample error Complexity of hypothesis set (size M currently) matters More data helps