Outline Overview of course plans, grading, topics

Outline Overview of course plans, grading, topics
Overview of Machine Learning Review of Probability and uncertainty Motivation Axiomatic treatment of probability Definitions and illustrations of some key concepts Classification and K-NN Decision trees and rules

Key Concepts in Probability: Bayes rule

Some practical problems
I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?

posterior prior P(B|A) * P(A) Bayes’ rule P(A|B) = P(B) P(A|B) * P(B)
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53: …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…

Probability - what you need to really, really know
Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials, … Conditional probabilities Bayes Rule

Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = P(B|~A)=0.0001 perl try.pl 0.01 q = perl try.pl q = perl try.pl 0.001 q =

What’s the experiment and outcome here?
Outcome A: Joe is cheating Experiment: Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one. Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0. I have no idea, but I don’t like his looks. Call it P(A)=0.1

Remember: Don’t Mess with The Axioms
A subjective belief can be treated, mathematically, like a probability Use those axioms! There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, … 25 years ago people in AI argued about these; now they mostly don’t Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

perl try.pl 0.01 q = perl try.pl q = perl try.pl 0.001 q = Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = P(B|~A)=0.0001 Moral: with enough evidence the prior P(A) doesn’t really matter.

Key Concepts in Probability: SMOOTHING, MLE, and MAP

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

One solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=0 P(2)=0 P(3)=0 P(4)=0.1 … P(19)=0.25 P(20)=0.2 MLE = maximum likelihood estimate But: Do I really think it’s impossible to roll a 1,2 or 3?

A better solution 0. Imagine some data (20 rolls, each i shows up 1x)
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 0. Imagine some data (20 rolls, each i shows up 1x) 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/8 0.25 vs – really different! Maybe I should “imagine” less data?

A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? I can use this formula with m>20, or even with m<20 … say with m=1

A better solution Q: What if I used m rolls with a probability of q=1/20 of rolling any i? If m>>C(ANY) then your imagination q rules If m<<C(ANY) then your data rules BUT you never ever ever end up with Pr(i)=0

Terminology – more later
This is called a symettric Dirichlet prior C(i), C(ANY) are sufficient statistics MLE = maximum likelihood estimate MAP= maximum a posteriori estimate

Why we call this a MAP Simpler case: replace the die with a coin
Now there’s one parameter: x=P(H) I start with a prior over x, P(x) --- a continuous pdf I get some data: D={D1=H, D2=T, ….} I compute the posterior of x The math works if the pdf f(x) = α,β are numbers of imagined pos/neg examples

Why we call this a MAP The math works if the pdf f(x) =
α,β are imagined pos/neg examples

Why we call this a MAP This is called a beta distribution
The generalization to multinomials is called a Dirichlet distribution Parameters are f(x1,…,xK) =

Some MORE TERMS

Cumulative Density Functions
Total probability Probability Density Function (PDF) Properties: F(x)

Expectations Mean/Expected Value: Variance:
Expected value of any function g(x) of x: Most common distributions (binomial,multinomial, Gaussian, …) have closed form formulas for variance, mean.

(Univariate) Gaussian

Multivariate Gaussians

Key Concepts in Probability: The JOINT DISTRIBUTION

Probability - what you need to really, really know
Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution

I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)= Loaded low: P(X=1)=0.50 Experiment: pick two d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

A brute-force solution
Comment A Roll 1 Roll 2 P FL 1 1/3 * 1/6 * ½ 2 1/3 * 1/6 * 1/10 … 6 HL HF doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) …. seven doubles How big is the table? doubles

Estimating The Joint Distribution
Example: Boolean variables A, B, C A B C Prob 0.30 1 0.05 0.10 0.25 Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). For each combination of values, estimate how probable it is from data. If you subscribe to the axioms of probability, those numbers must sum to 1.

Copyright © Andrew W. Moore
Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Input Attributes Density Estimator Probability Copyright © Andrew W. Moore

Density Estimation – looking ahead
Compare it against the two other major kinds of models: Classifier Prediction of categorical output or class Input Attributes One of a few discrete values Density Estimator Probability Input Attributes Regressor Prediction of real-valued output Input Attributes Copyright © Andrew W. Moore

Density Estimation  Classification
Classifier Prediction of categorical output Input Attributes x One of y1, …., yk Density Estimator P(x,y) Input Attributes ^ Class To classify x Use your estimator to compute P(x,y1), …., P(x,yk) Return the class y* with the highest predicted probability ^ ^ Binary case: predict POS if P(x)>0.5 ^ ^ ^ ^ ^ Ideally is correct with P(x,y*) = P(x,y*)/(P(x,y1) + …. + P(x,yk)) Copyright © Andrew W. Moore

computing WITH a joint probability Estimate WITH MATLAB

Get some data % which die was used first and second
dice1 = randi(3,[n,1]); dice2 = randi(3,[n,1]); % did 'loading' happen for rolls for die 1 and die 2 load1 = randi(2,[n,1]); load2 = randi(2,[n,1]); % simulate rolling the dice… r1 = roll(dice1,load1,randi(5,[n,1]),randi(6,[n,1])); r2 = roll(dice2,load2,randi(5,[n,1]),randi(6,[n,1])); % append the column vectorsD = [dice1,dice2,r1,r2]; D = [dice1,dice2,r1,r2];

Get some data function [ face ] = roll(d,ld,upTo5,upTo6) % if d==1
% face = randi(6) % elseif d==2 & ld==1 % face = 6 % elseif d==2 & ld==0 % face = randi(5) % elseif d==3 & ld==1 % face = 1 % else % face = randi(5)+1 % end face = (d==1).*upTo (d==2).*(ld==1)*6 + (d==2).*(ld==0).*upTo (d==3).*(ld==1)*1 + ((d==3).*(ld==0).*upTo5 + 1); end

Get some data >> D(1:10,:) ans = 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2
>> imagesc(D) >>

… >> [X,I] = sort(4*D(:,1) + D(:,2)); >> S=D(I,:); >> imagesc(S);

… >> [X,I] = sort(4*D(:,1) + D(:,2)); >> S=D(I,:); >> imagesc(S); >> D34 = D(:,3:4); >> hist3(D34,[6,6])

Estimate a joint density
>> [H,C] = hist3(D34,[6,6]); >> H H = >> P = H/1000 P =

>> P=H/1000 P = >> SP = (H + (1/36))/1001 SP =

>> [I,J]=meshgrid(1:6,1:6); >> surf(I,J,SP);

Visualize a joint density
>> E = D34 + randn(1000,2)*0.1; >> plot(E(:,1),E(:,2),'r*');

Visualize a joint density
>> surf(I,J,SP); >> hold on >> plot3(E(:,1),E(:,2),zeros(1000,1)+0.1,'r*');

Compute with the joint density
>> sum(SP(sub2ind(size(SP),Sevens(:,1),Sevens(:,2)))) ans = 0.1630 >> sum(SP(sub2ind(size(SP),Doubles(:,1),Doubles(:,2)))) 0.1860 >> [I,J]=find(SP); >> IJ = [I,J] IJ = … >> >> IJ(I+J==7,:) ans = >> IJ(I==J,:)

Or by counting datapoints
>> sum(SP(sub2ind(size(SP),Sevens(:,1),Sevens(:,2)))) ans = >> sum(SP(sub2ind(size(SP),Doubles(:,1),Doubles(:,2)))) ans = >> sum(D(:,3)==D(:,4))/1000 >> sum(D(:,3)+D(:,4)==7)/1000

Inference with the joint
What is P(both die fair|roll > 10) ? P(both die fair & boxcars) / P(roll > 10) >> sum((D(:,1)==1) & (D(:,2)==1) & (D(:,3)+D(:,4) >= 10)) ans = 9 >> sum((D(:,3)+D(:,4) >= 10)) 112 >> 9/112 0.0804 >> sum( ((D(:,1)~=1) | (D(:,2)~=1)) & (D(:,3)+D(:,4) > 10) ) ans = 103 >> sum((D(:,3)+D(:,4) > 10) ) 112 >> 103/112 0.9196

CLASSIFICATION

One definition of machine learning
The study of programs that improve their performance (P) at some task (T) based on experience (E). Different types of machine learning arise when when we vary P, T, and E. One of many examples: classification learning

Density Estimation – looking ahead
Compare it against the two other major kinds of models: Classifier Prediction of categorical output or class Input Attributes One of a few discrete values Density Estimator Probability Input Attributes Regressor Prediction of real-valued output Input Attributes Copyright © Andrew W. Moore

Many domains and applications
ML is the preferred method to solve a growing list of problems: speech recognition natural language processing robot control vision … Driven by: better algorithms/faster machines increased amount of data for many tasks cheaper sensors and suitability for complex, hard-to-program tasks need for user-modeling, customization to an environment, …

Classification vs Density Estimation

Classification vs density estimation

Classification learning
Task T: input: output: Performance metric P: Experience E:

Task T: input: a set of instances x1,…,xn an instance has a set of features, each of which has a value, usually numeric. features ~= dimensions of a vector, so we can represent an instance as a vector x=<v1,…,vd> output: a set of predictions one of a fixed set of constant values: {+1,-1} or {cancer, healthy}, or {dog,cat,walrus,…}, or … Performance metric P: Experience E:

Classification Learning
Task Instance Labels medical diagnosis patient record: blood pressure diastolic, blood pressure systolic, age, sex (0 or 1), BMI, cholesterol {-1,+1} = low, high risk of heart disease finding company names in text a word in context: capitalized (0,1), word-after-this-equals-Inc, bigram-before-this-equals-acquired-by, … {first,later,outside} = first word in name, second or later word in name, not in a name brain-human-interface brain state: neural activity over the last 100ms of 96 neurons {n,s,e,w,ne,se,nw,sw} = direction you intend to move the cursor image recognition image: 1920*1080 pixels, each with a code for color {0,1} = no walrus, walrus

we care about performance on the distribution, not the training data Task T: input: a set of instances x1,…,xn output: a set of predictions Performance metric P: Prob(wrong prediction) Experience E: a set of labeled examples (x,y) where y is the true label for x ideally, examples should be sampled from some fixed distribution D on example from D

Classification Learning
Task Instance Labels Getting data medical diagnosis patient record: lab readings risk of heart disease wait and look for heart disease finding company names in text a word in context: capitalized, nearby words, ... {first,later,outside} text with company names highlighted (say by hand) brain-human-interface brain state: neural activity over the last 100ms of 96 neurons {n,s,e,w,ne,se,nw,sw} recordings of someone doing known tasks (so direction can be inferred) image recognition image: pixels no walrus, walrus hand-labeled images can you get examples some other way?

Google image search: “walrus”

Image search: “flying bumblebee”
Could I use these as negative examples?

Nearest Neighbor Learning: Overview

Task T: input: a set of instances x1,…,xn drawn from D output: a set of predictions Performance metric P: Prob(wrong prediction) Experience E: a set of labeled examples (x,y) where y is the true label for x How do we make predictions? on example from D

k-nearest neighbor learning
Given a test example x: Find the k training-set examples (x1,y1),….,(xk,yk) that are closest to x. Predict the most frequent label in that set. ? ?

Breaking it down: To train: save the data To test:
For each test example x: Very fast! …you might build some indices…. Find the k training-set examples (x1,y1),….,(xk,yk) that are closest to x. Predict the most frequent label in that set. Prediction is relatively slow (compared to a most other classifiers)

k-nearest neighbor learning: no obvious “decision boundary”
? ? If examples are vectors so instances live in some (high-dimensiona) space - learning splits the space up into area where we predict setosa or non-setosa

What is the decision boundary for k-NN? let’s look at k=1
Voronoi Diagram - - Each cell Ci is the set of all points that are closest to a particular example xi - + * - + + - - -

The decision boundary Voronoi Diagram - - - + - + + - - -

Effect of k on decision boundary

Overfitting and k-NN Large k  a smooth shape for the decision boundary Small k  a complicated shape What’s the best value of k? hi error Error/Loss on an unseen test set Dtest Error/Loss on training set D large k small k

Some common variants Distance metrics: Euclidean distance: ||x1 - x2||
Cosine distance: 1 - <x1,x2>/||x1||*||x2|| this is in [0,1] Weighted nearest neighbor (def 1): Instead of most frequent y in k-NN predict

Some common variants Weighted nearest neighbor (def 2):
Instead of Euclidean distance use a weighted version Weights might be based on info gain,….

Outline Overview of course plans, grading, topics

Similar presentations

Presentation on theme: "Outline Overview of course plans, grading, topics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Overview of course plans, grading, topics

Similar presentations

Presentation on theme: "Outline Overview of course plans, grading, topics"— Presentation transcript:

Similar presentations

About project

Feedback