Outline Overview of course plans, grading, topics

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Classification on high octane (1): Naïve Bayes (hopefully, with Hadoop) COSC 526 Class 3 Arvind Ramanathan Computational Science & Engineering Division.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Machine Learning CMPT 726 Simon Fraser University
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
Crash Course on Machine Learning
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 6 Bayesian Learning
A Quick Overview of Probability William W. Cohen Machine Learning
Naïve Bayes Classifiers William W. Cohen. TREE PRUNING.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Stat 31, Section 1, Last Time Big Rules of Probability –The not rule –The or rule –The and rule P{A & B} = P{A|B}P{B} = P{B|A}P{A} Bayes Rule (turn around.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Machine Learning: Ensemble Methods
Probability Theory for ML
Applied statistics Usman Roshan.
Oliver Schulte Machine Learning 726
Matt Gormley Lecture 3 September 7, 2016
Review of Probability.
Chapter 7. Classification and Prediction
Oliver Schulte Machine Learning 726
Probability David Kauchak CS158 – Fall 2013.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Naive Bayes Classifier
Bayes Net Learning: Bayesian Approaches
Overview of Supervised Learning
Review of Probability and Estimators Arun Das, Jason Rebello
CSCI 5822 Probabilistic Models of Human and Machine Learning
K Nearest Neighbor Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Probability Theory.
Ensemble learning.
Wellcome Trust Centre for Neuroimaging
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Naive Bayes Classifier
Mathematical Foundations of BME Reza Shadmehr
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

Outline Overview of course plans, grading, topics Overview of Machine Learning Review of Probability and uncertainty Motivation Axiomatic treatment of probability Definitions and illustrations of some key concepts Classification and K-NN Decision trees and rules

Key Concepts in Probability: Bayes rule

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?

P(A|B) = ? P(A and B) = P(A|B) * P(B) P(B|A) * P(A) P(B) P(A|B) = P(A) = P(B|A) * P(A and B) = P(B|A) * P(A) P(B) P(A|B) * P(B) = P(B|A) * P(A) P(B) P(A) P(~A) A (fair die) ~A (loaded) ~A and B A and B P(B|A) P(B|~A)

posterior prior P(B|A) * P(A) Bayes’ rule P(A|B) = P(B) P(A|B) * P(B) Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials, … Conditional probabilities Bayes Rule

Some practical problems Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = 0.0625 P(B|~A)=0.0001 perl try.pl 0.01 q = 0.863259668508287 perl try.pl 0.0001 q = 0.0588290662650602 perl try.pl 0.001 q = 0.384852216748769

What’s the experiment and outcome here? Outcome A: Joe is cheating Experiment: Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one. Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0. I have no idea, but I don’t like his looks. Call it P(A)=0.1

Remember: Don’t Mess with The Axioms A subjective belief can be treated, mathematically, like a probability Use those axioms! There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, … 25 years ago people in AI argued about these; now they mostly don’t Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Some practical problems perl try.pl 0.01 q = 0.863259668508287 perl try.pl 0.0001 q = 0.0588290662650602 perl try.pl 0.001 q = 0.384852216748769 Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = 0.0625 P(B|~A)=0.0001 Moral: with enough evidence the prior P(A) doesn’t really matter.

Key Concepts in Probability: SMOOTHING, MLE, and MAP

Some practical problems I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

One solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=0 P(2)=0 P(3)=0 P(4)=0.1 … P(19)=0.25 P(20)=0.2 MLE = maximum likelihood estimate But: Do I really think it’s impossible to roll a 1,2 or 3?

A better solution 0. Imagine some data (20 rolls, each i shows up 1x) I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 0. Imagine some data (20 rolls, each i shows up 1x) 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/8 0.25 vs. 0.125 – really different! Maybe I should “imagine” less data?

A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? I can use this formula with m>20, or even with m<20 … say with m=1

A better solution Q: What if I used m rolls with a probability of q=1/20 of rolling any i? If m>>C(ANY) then your imagination q rules If m<<C(ANY) then your data rules BUT you never ever ever end up with Pr(i)=0

Terminology – more later This is called a symettric Dirichlet prior C(i), C(ANY) are sufficient statistics MLE = maximum likelihood estimate MAP= maximum a posteriori estimate

Why we call this a MAP Simpler case: replace the die with a coin Now there’s one parameter: x=P(H) I start with a prior over x, P(x) --- a continuous pdf I get some data: D={D1=H, D2=T, ….} I compute the posterior of x The math works if the pdf f(x) = α,β are numbers of imagined pos/neg examples

Why we call this a MAP The math works if the pdf f(x) = α,β are imagined pos/neg examples

Why we call this a MAP This is called a beta distribution The generalization to multinomials is called a Dirichlet distribution Parameters are f(x1,…,xK) =

Some MORE TERMS

Cumulative Density Functions Total probability Probability Density Function (PDF) Properties: F(x)

Expectations Mean/Expected Value: Variance: Expected value of any function g(x) of x: Most common distributions (binomial,multinomial, Gaussian, …) have closed form formulas for variance, mean.

(Univariate) Gaussian

Multivariate Gaussians

Key Concepts in Probability: The JOINT DISTRIBUTION

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution

Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

A brute-force solution Comment A Roll 1 Roll 2 P FL 1 1/3 * 1/6 * ½ 2 1/3 * 1/6 * 1/10 … 6 HL HF doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) …. seven doubles How big is the table? doubles

Estimating The Joint Distribution Example: Boolean variables A, B, C A B C Prob 0.30 1 0.05 0.10 0.25 Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). For each combination of values, estimate how probable it is from data. If you subscribe to the axioms of probability, those numbers must sum to 1.

Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Input Attributes Density Estimator Probability Copyright © Andrew W. Moore

Density Estimation – looking ahead Compare it against the two other major kinds of models: Classifier Prediction of categorical output or class Input Attributes One of a few discrete values Density Estimator Probability Input Attributes Regressor Prediction of real-valued output Input Attributes Copyright © Andrew W. Moore

Density Estimation  Classification Classifier Prediction of categorical output Input Attributes x One of y1, …., yk Density Estimator P(x,y) Input Attributes ^ Class To classify x Use your estimator to compute P(x,y1), …., P(x,yk) Return the class y* with the highest predicted probability ^ ^ Binary case: predict POS if P(x)>0.5 ^ ^ ^ ^ ^ Ideally is correct with P(x,y*) = P(x,y*)/(P(x,y1) + …. + P(x,yk)) Copyright © Andrew W. Moore

computing WITH a joint probability Estimate WITH MATLAB

Get some data % which die was used first and second dice1 = randi(3,[n,1]); dice2 = randi(3,[n,1]); % did 'loading' happen for rolls for die 1 and die 2 load1 = randi(2,[n,1]); load2 = randi(2,[n,1]); % simulate rolling the dice… r1 = roll(dice1,load1,randi(5,[n,1]),randi(6,[n,1])); r2 = roll(dice2,load2,randi(5,[n,1]),randi(6,[n,1])); % append the column vectorsD = [dice1,dice2,r1,r2]; D = [dice1,dice2,r1,r2];

Get some data function [ face ] = roll(d,ld,upTo5,upTo6) % if d==1 % face = randi(6) % elseif d==2 & ld==1 % face = 6 % elseif d==2 & ld==0 % face = randi(5) % elseif d==3 & ld==1 % face = 1 % else % face = randi(5)+1 % end face = (d==1).*upTo6 + ... (d==2).*(ld==1)*6 + (d==2).*(ld==0).*upTo5 + ... (d==3).*(ld==1)*1 + ((d==3).*(ld==0).*upTo5 + 1); end

Get some data >> D(1:10,:) ans = 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 3 2 2 1 1 2 4 7 2 3 7 2 3 3 1 2 1 1 3 6 >> imagesc(D) >>

Get some data >> D(1:10,:) ans = 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 … >> [X,I] = sort(4*D(:,1) + D(:,2)); >> S=D(I,:); >> imagesc(S);

Get some data >> D(1:10,:) ans = 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 1 1 7 4 2 1 1 3 2 3 1 1 3 1 1 2 … >> [X,I] = sort(4*D(:,1) + D(:,2)); >> S=D(I,:); >> imagesc(S); >> D34 = D(:,3:4); >> hist3(D34,[6,6])

Estimate a joint density >> [H,C] = hist3(D34,[6,6]); >> H H = 60 35 24 29 30 60 42 16 14 19 14 22 27 19 15 10 17 45 44 19 17 18 20 29 31 11 12 9 22 40 51 26 44 37 17 55 >> P = H/1000 P = 0.0600 0.0350 0.0240 0.0290 0.0300 0.0600 0.0420 0.0160 0.0140 0.0190 0.0140 0.0220 0.0270 0.0190 0.0150 0.0100 0.0170 0.0450 0.0440 0.0190 0.0170 0.0180 0.0200 0.0290 0.0310 0.0110 0.0120 0.0090 0.0220 0.0400 0.0510 0.0260 0.0440 0.0370 0.0170 0.0550

Estimate a joint density >> P=H/1000 P = 0.1220 0.0680 0.0170 0.0130 0.0150 0.0940 0.0820 0.0480 0.0110 0.0130 0.0210 0.0580 0.0170 0.0100 0 0.0050 0.0030 0.0150 0.0230 0.0120 0.0050 0.0050 0.0020 0.0130 0.0160 0.0120 0.0020 0.0020 0.0030 0.0160 0.0970 0.0590 0.0110 0.0160 0.0120 0.0820 >> SP = (H + (1/36))/1001 SP = 0.0600 0.0350 0.0240 0.0290 0.0300 0.0600 0.0420 0.0160 0.0140 0.0190 0.0140 0.0220 0.0270 0.0190 0.0150 0.0100 0.0170 0.0450 0.0440 0.0190 0.0170 0.0180 0.0200 0.0290 0.0310 0.0110 0.0120 0.0090 0.0220 0.0400 0.0510 0.0260 0.0440 0.0370 0.0170 0.0550

Estimate a joint density >> [I,J]=meshgrid(1:6,1:6); >> surf(I,J,SP);

Visualize a joint density >> E = D34 + randn(1000,2)*0.1; >> plot(E(:,1),E(:,2),'r*');

Visualize a joint density >> surf(I,J,SP); >> hold on >> plot3(E(:,1),E(:,2),zeros(1000,1)+0.1,'r*');

Compute with the joint density >> sum(SP(sub2ind(size(SP),Sevens(:,1),Sevens(:,2)))) ans = 0.1630 >> sum(SP(sub2ind(size(SP),Doubles(:,1),Doubles(:,2)))) 0.1860 >> [I,J]=find(SP); >> IJ = [I,J] IJ = 1 1 2 1 3 1 4 1 5 1 6 1 1 2 2 2 3 2 … >> >> IJ(I+J==7,:) ans = 6 1 5 2 4 3 3 4 2 5 1 6 >> IJ(I==J,:) 1 1 2 2 3 3 4 4 5 5 6 6

Or by counting datapoints >> sum(SP(sub2ind(size(SP),Sevens(:,1),Sevens(:,2)))) ans = 0.1630 >> sum(SP(sub2ind(size(SP),Doubles(:,1),Doubles(:,2)))) ans = 0.1860 >> sum(D(:,3)==D(:,4))/1000 >> sum(D(:,3)+D(:,4)==7)/1000

Inference with the joint What is P(both die fair|roll > 10) ? P(both die fair & boxcars) / P(roll > 10) >> sum((D(:,1)==1) & (D(:,2)==1) & (D(:,3)+D(:,4) >= 10)) ans = 9 >> sum((D(:,3)+D(:,4) >= 10)) 112 >> 9/112 0.0804 >> sum( ((D(:,1)~=1) | (D(:,2)~=1)) & (D(:,3)+D(:,4) > 10) ) ans = 103 >> sum((D(:,3)+D(:,4) > 10) ) 112 >> 103/112 0.9196

CLASSIFICATION

One definition of machine learning The study of programs that improve their performance (P) at some task (T) based on experience (E). Different types of machine learning arise when when we vary P, T, and E. One of many examples: classification learning

Density Estimation – looking ahead Compare it against the two other major kinds of models: Classifier Prediction of categorical output or class Input Attributes One of a few discrete values Density Estimator Probability Input Attributes Regressor Prediction of real-valued output Input Attributes Copyright © Andrew W. Moore

Many domains and applications ML is the preferred method to solve a growing list of problems: speech recognition natural language processing robot control vision … Driven by: better algorithms/faster machines increased amount of data for many tasks cheaper sensors and suitability for complex, hard-to-program tasks need for user-modeling, customization to an environment, …

Classification vs Density Estimation

Classification vs density estimation

Classification learning Task T: input: output: Performance metric P: Experience E:

Classification learning Task T: input: a set of instances x1,…,xn an instance has a set of features, each of which has a value, usually numeric. features ~= dimensions of a vector, so we can represent an instance as a vector x=<v1,…,vd> output: a set of predictions one of a fixed set of constant values: {+1,-1} or {cancer, healthy}, or {dog,cat,walrus,…}, or … Performance metric P: Experience E:

Classification Learning Task Instance Labels medical diagnosis patient record: blood pressure diastolic, blood pressure systolic, age, sex (0 or 1), BMI, cholesterol {-1,+1} = low, high risk of heart disease finding company names in text a word in context: capitalized (0,1), word-after-this-equals-Inc, bigram-before-this-equals-acquired-by, … {first,later,outside} = first word in name, second or later word in name, not in a name brain-human-interface brain state: neural activity over the last 100ms of 96 neurons {n,s,e,w,ne,se,nw,sw} = direction you intend to move the cursor image recognition image: 1920*1080 pixels, each with a code for color {0,1} = no walrus, walrus

Classification learning we care about performance on the distribution, not the training data Task T: input: a set of instances x1,…,xn output: a set of predictions Performance metric P: Prob(wrong prediction) Experience E: a set of labeled examples (x,y) where y is the true label for x ideally, examples should be sampled from some fixed distribution D on example from D

Classification Learning Task Instance Labels Getting data medical diagnosis patient record: lab readings risk of heart disease wait and look for heart disease finding company names in text a word in context: capitalized, nearby words, ... {first,later,outside} text with company names highlighted (say by hand) brain-human-interface brain state: neural activity over the last 100ms of 96 neurons {n,s,e,w,ne,se,nw,sw} recordings of someone doing known tasks (so direction can be inferred) image recognition image: pixels no walrus, walrus hand-labeled images can you get examples some other way?

Google image search: “walrus”

Image search: “flying bumblebee” Could I use these as negative examples?

Nearest Neighbor Learning: Overview

Classification learning Task T: input: a set of instances x1,…,xn drawn from D output: a set of predictions Performance metric P: Prob(wrong prediction) Experience E: a set of labeled examples (x,y) where y is the true label for x How do we make predictions? on example from D

k-nearest neighbor learning Given a test example x: Find the k training-set examples (x1,y1),….,(xk,yk) that are closest to x. Predict the most frequent label in that set. ? ?

Breaking it down: To train: save the data To test: For each test example x: Very fast! …you might build some indices…. Find the k training-set examples (x1,y1),….,(xk,yk) that are closest to x. Predict the most frequent label in that set. Prediction is relatively slow (compared to a most other classifiers)

k-nearest neighbor learning: no obvious “decision boundary” ? ? If examples are vectors so instances live in some (high-dimensiona) space - learning splits the space up into area where we predict setosa or non-setosa

What is the decision boundary for k-NN? let’s look at k=1 Voronoi Diagram - - Each cell Ci is the set of all points that are closest to a particular example xi - + * - + + - - -

The decision boundary Voronoi Diagram - - - + - + + - - -

Effect of k on decision boundary

Overfitting and k-NN Large k  a smooth shape for the decision boundary Small k  a complicated shape What’s the best value of k? hi error Error/Loss on an unseen test set Dtest Error/Loss on training set D large k small k

Some common variants Distance metrics: Euclidean distance: ||x1 - x2|| Cosine distance: 1 - <x1,x2>/||x1||*||x2|| this is in [0,1] Weighted nearest neighbor (def 1): Instead of most frequent y in k-NN predict

Some common variants Weighted nearest neighbor (def 2): Instead of Euclidean distance use a weighted version Weights might be based on info gain,….