1 Instance-Based & Bayesian Learning Chapter 20.1-20.4 Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Bayesian network inference
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 5: Learning models using EM
CES 514 – Data Mining Lecture 8 classification (contd…)
Bayesian Belief Networks
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Neural Networks Marco Loog.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Learning Bayesian Networks
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
EM and expected complete log-likelihood Mixture of Experts
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
CSCI 121 Special Topics: Bayesian Networks Lecture #4: Learning in Bayes Nets.
Machine Learning 5. Parametric Methods.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Class #21 – Tuesday, November 10
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
CS 4/527: Artificial Intelligence
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Professor Marie desJardins,
Bayesian Learning Chapter
Class #25 – Monday, November 29
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr

2 Overview k-nearest neighbor Naïve Bayes Learning Bayes networks

Instance Based Learning Decision trees are a kind of model-based learning –We take the training instances and use them to build a model of the mapping from inputs to outputs –This model (e.g., a decision tree) can be used to make predictions on new (test) instances Another option is to do instance-based learning –Save all (or some subset) of the instances –Given a test instance, use some of the stored instances in some way to make a prediction Instance-based methods: –Nearest neighbor and its variants –Support vector machines 3

Nearest Neighbor Vanilla “Nearest Neighbor”: –Save all training instances X i = (C i, F i ) in T –Given a new test instance Y, find the instance X j that is closest to Y –Predict class C i What does “closest” mean? –Usually: Euclidean distance in feature space –Alternatively: Manhattan distance, or any other distance metric What if the data is noisy? –Generalize to k-nearest neighbor –Find the k closest training instances to Y –Use majority voting to predict the class label of Y –Better yet: use weighted (by distance) voting to predict the class label 4

Nearest Neighbor Example: Run Outside (+) or Inside (-) 5 Humidity Temperature Noisy data Not linearly separable Decision tree boundary (not very good...) ? ? ? ? - - ? ? ?

Naïve Bayes Use Bayesian modeling Make the simplest possible independence assumption: –Each attribute is independent of the values of the other attributes, given the class variable –In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not) 6

Bayesian Formulation p(C | F 1,..., F n ) = p(C) p(F 1,..., F n | C) / P(F 1,..., F n ) = α p(C) p(F 1,..., F n | C) Assume that each feature F i is conditionally independent of the other features given the class C. Then: p(C | F 1,..., F n ) = α p(C) Π i p(F i | C) We can estimate each of these conditional probabilities from the observed counts in the training data: p(F i | C) = N(F i ∧ C) / N(C) –One subtlety of using the algorithm in practice: When your estimated probabilities are zero, ugly things happen –The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!) 7

Naive Bayes: Example p(Wait | Cuisine, Patrons, Rainy?) = α p(Wait) p(Cuisine | Wait) p(Patrons | Wait) p(Rainy? | Wait) 8

Naive Bayes: Analysis Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it) Remarkably, naive Bayes can outperform many much more complex algorithms—it’s a baseline that should pretty much always be used for comparison Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets! 9

Learning Bayesian Networks 10

11 Bayesian learning: Bayes’ rule Given some model space (set of hypotheses h i ) and evidence (data D): –P(h i |D) =  P(D|h i ) P(h i ) We assume that observations are independent of each other, given a model (hypothesis), so: –P(h i |D) =   j P(d j |h i ) P(h i ) To predict the value of some unknown quantity, X (e.g., the class label for a future observation): –P(X|D) =  i P(X|D, h i ) P(h i |D) =  i P(X|h i ) P(h i |D) These are equal by our independence assumption

12 Bayesian learning Apply Bayesian learning in three basic ways: –BMA (Bayesian Model Averaging): Don’t just choose one hypothesis; make predictions based on weighted average of all hypotheses (or some set of best ones) –MAP (Maximum A Posteriori) hypothesis: Choose the hypothesis with highest a posteriori probability, given data –MLE (Maximum Likelihood Estimate): Assume that all hypotheses are equally likely a priori; then the best hypothesis is just the one that maximizes the likelihood (i.e., the probability of the data given the hypothesis) MDL (Minimum Description Length) principle: Use some encoding to model the complexity of the hypothesis, and the fit of the data to the hypothesis, then minimize the overall description of h i + D

13 Learning Bayesian networks Given training set Find B that best matches D –model selection –parameter estimation Data D Inducer C A EB

14 Parameter estimation Assume known structure Goal: estimate BN parameters  –entries in local probability models, P(X | Parents(X)) A parameterization  is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L i.i.d. samples

15 Parameter estimation II The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: –for each value x of a node X –and each instantiation u of Parents(X) –Just need to collect the counts for every combination of parents and children observed in the data –MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

16 Sufficient statistics: Example Why are the counts sufficient? EarthquakeBurglary Alarm Moon-phase Light-level θ * A | E, B = N(A, E, B) / N(E, B)

17 Model selection Goal: Select the best network structure, given the data Input: –Training data –Scoring function Output: –A network that maximizes the score

18 Structure selection: Scoring Bayesian: prior over parameters and structure –get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Same key property: Decomposability Score(structure) =  i Score(family of X i ) Marginal likelihood Prior

19 Heuristic search B E A C B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A,E) Reverse E  A

20 Exploiting decomposability B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A B E A C Δscore(A) Delete E  A To recompute scores, only need to re-score families that changed in the last move

21 Variations on a theme Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve!

22 Handling missing data Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Should we throw that data away?? Idea: Guess the missing values based on the other data EarthquakeBurglary Alarm Moon-phase Light-level

23 EM (expectation maximization) Guess probabilities for nodes with missing values (e.g., based on other observations) Compute the probability distribution over the missing values, given our guess Update the probabilities based on the guessed values Repeat until convergence

24 EM example Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 We estimate the CPTs based on the rest of the data We then estimate P(Burglary) for November 27 from those CPTs Now we recompute the CPTs as if that estimated value had been observed Repeat until convergence! EarthquakeBurglary Alarm