Crash Course on Machine Learning

Name: Crash Course on Machine Learning
Uploaded: 2017-08-28T06:34:55+00:00
Duration: PTM14S26
Channel: Rachel Townsend
Description: Crash Course on Machine Learning

Crash Course on Machine Learning
Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar

Typical Paradigms of Recognition
Feature Computation Model

Visual Recognition Identification Classification Is this your car?

Visual Recognition Verification Classification Is this a car?

Visual Recognition Classification Classification:
Is there a car in this picture? Classification

Visual Recognition Classification Structure Learning Detection:
Where is the car in this picture? Classification Structure Learning

Visual Recognition Classification Activity Recognition:
What is he doing? What is he doing?

Visual Recognition Pose Estimation: Regression Structure Learning

Visual Recognition Structure Learning Object Categorization: Sky
Person Tree Horse Car Person Bicycle Road

Visual Recognition Classification Structure Learning Segmentation Sky
Tree Car Person

What kind of problems? Classification Regression Structure Learning
Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

Let’s play with probability for a bit Remembering simple stuff

Thumbtack & Probabilities
P(Heads) = , P(Tails) = 1- Flips are i.i.d.: Independent events Identically distributed according to Binomial distribution Sequence D of H Heads and T Tails … D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Maximum Likelihood Estimation
Data: Observed set D of H Heads and T Tails Hypothesis: Binomial distribution Learning: finding  is an optimization problem What’s the objective function? MLE: Choose  to maximize probability of D

Parameter learning Set derivative to zero, and solve!

But, how many flips do I need?
3 heads and 2 tails.  = 3/5, I can prove it! What if I flipped 30 heads and 20 tails? Same answer, I can prove it! What’s better? Umm… The more the merrier???

A bound (from Hoeffding’s inequality)
For N =H+T, and Let  * be the true parameter, for any >0: N Prob. of Mistake Exponential Decay! Hoeffding’s inequality: provides a bound on the probability that a sum of random variables deviates from its expectation.

What if I have prior beliefs?
Wait, I know that the thumbtack is “close” to What can you do for me now? Rather than estimating a single , we obtain a distribution over possible values of  In the beginning After observations Observe flips e.g.: {tails, tails}

How to use Prior Use Bayes rule: Or equivalently:
Data Likelihood Posterior Normalization Or equivalently: Also, for uniform priors: Only made it to Gaussians. Probability review was pretty painful, look 15+ minutes.  reduces to MLE objective

Beta prior distribution – P()
Likelihood function: Posterior:

MAP for Beta distribution
MAP: use most likely parameter:

What about continuous variables?

We like Gaussians because
Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(,2) Y = aX + b  Y ~ N(a+b,a22) Sum of Gaussians is Gaussian X ~ N(X,2X) Y ~ N(Y,2Y) Z = X+Y  Z ~ N(X+Y, 2X+2Y) Easy to differentiate

Learning a Gaussian Collect a bunch of data Learn parameters
xi i = Exam Score 85 1 95 2 100 3 12 … 99 89 Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ

MLE for Gaussian: Prob. of i.i.d. samples D={x1,…,xN}:
Log-likelihood of data:

MLE for mean of a Gaussian
What’s MLE for mean?

MLE for variance Again, set derivative to zero:

Learning Gaussian parameters
MLE:

MAP Conjugate priors Prior for mean: Mean: Gaussian prior
Variance: Wishart Distribution Prior for mean:

Supervised Learning: find f
Given: Training set {(xi, yi) | i = 1 … n} Find: A good approximation to f : X  Y What is x? What is y?

Simple Example: Digit Recognition
Input: images / pixel grids Output: a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features: ? 1 2 1 Screw You, I want to use Pixels :D ??

Lets take a probabilistic approach!!!
Can we directly estimate the data distribution P(X,Y)? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X|Y): Suppose X is composed of n binary features

Conditional Independence
X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

Naïve Bayes Naïve Bayes assumption:
Features are independent given class: More generally:

The Naïve Bayes Classifier
Given: Prior P(Y) n conditionally independent features X given the class Y For each Xi, we have likelihood P(Xi|Y) Decision rule: Y X1 X2 Xn

A Digit Recognizer Input: pixel grids Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs)
Simple version: One feature Fij for each grid position <i,j> Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image Each input maps to a feature vector, e.g. Here: lots of features, each is binary valued Naïve Bayes model: Are the features independent given class? What do we need to learn?

Example Distributions
1 0.1 2 3 4 5 6 7 8 9 1 0.01 2 0.05 3 4 0.30 5 0.80 6 0.90 7 8 0.60 9 0.50 1 0.05 2 0.01 3 0.90 4 0.80 5 6 7 0.25 8 0.85 9 0.60

MLE for the parameters of NB
Given dataset Count(A=a,B=b) number of examples where A=a and B=b MLE for discrete NB, simply: Prior: Likelihood:

Violating the NB assumption
Usually, features are not conditionally independent: NB often performs well, even when assumption is violated [Domingos & Pazzani ’96] discuss some conditions for good performance

Smoothing 2 wins!! Does this happen in vision?

NB & Bag of words model

What about real Features? What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., )

Estimating Parameters
Maximum likelihood estimates: Mean: Variance: jth training example (x)=1 if x true, else 0

What you need to know about Naïve Bayes
Naïve Bayes classifier What’s the assumption Why we use it How do we learn it Why is Bayesian estimation important Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class Optimal decision using Bayes Classifier

another probabilistic approach!!!
Naïve Bayes: directly estimate the data distribution P(X,Y)! challenging due to size of distribution! make Naïve Bayes assumption: only need P(Xi|Y)! But wait, we classify according to: maxY P(Y|X) Why not learn P(Y|X) directly?

Crash Course on Machine Learning

Similar presentations

Presentation on theme: "Crash Course on Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crash Course on Machine Learning

Similar presentations

Presentation on theme: "Crash Course on Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback