Class #25 – Monday, November 29

Slides:



Advertisements
Similar presentations
PARTITIONAL CLUSTERING
Advertisements

Unsupervised Learning
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Overview Full Bayesian Learning MAP learning
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Clustering Unsupervised learning Generating “classes”
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Naive Bayes Classifier
CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins,
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 6 Bayesian Learning
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Slides for “Data Mining” by I. H. Witten and E. Frank.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Machine Learning 5. Parametric Methods.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Statistical Learning Methods
Class #21 – Tuesday, November 10
Semi-Supervised Clustering
Qian Liu CSE spring University of Pennsylvania
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning Lecture 9: Clustering
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
CS 4/527: Artificial Intelligence
Hidden Markov Models Part 2: Algorithms
Revision (Part II) Ke Chen
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Revision (Part II) Ke Chen
EM for Inference in MV Data
Class #19 – Tuesday, November 3
Professor Marie desJardins,
Bayesian Learning Chapter
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Boltzmann Machine (BM) (§6.4)
Text Categorization Berlin Chen 2003 Reference:
Biointelligence Laboratory, Seoul National University
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
EM for Inference in MV Data
Machine Learning: UNIT-3 CHAPTER-1
Hairong Qi, Gonzalez Family Professor
EM Algorithm and its Applications
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Class #25 – Monday, November 29 CMSC 671 Fall 2010 Class #25 – Monday, November 29

Instance-Based & Bayesian Learning Chapter 18.8, 20 (parts) Some material adapted from lecture notes by Lise Getoor and Ron Parr

Today’s class k-nearest neighbor Naïve Bayes Learning Bayes networks Clustering

Instance-Based Learning K-nearest neighbor

IBL Decision trees are a kind of model-based learning We take the training instances and use them to build a model of the mapping from inputs to outputs This model (e.g., a decision tree) can be used to make predictions on new (test) instances Another option is to do instance-based learning Save all (or some subset) of the instances Given a test instance, use some of the stored instances in some way to make a prediction Instance-based methods: Nearest neighbor and its variants (today) Support vector machines (next time)

Nearest Neighbor Vanilla “Nearest Neighbor”: What does “closest” mean? Save all training instances Xi = (Ci, Fi) in T Given a new test instance Y, find the instance Xj that is closest to Y Predict class Ci What does “closest” mean? Usually: Euclidean distance in feature space Alternatively: Manhattan distance, or any other distance metric What if the data is noisy? Generalize to k-nearest neighbor Find the k closest training instances to Y Use majority voting to predict the class label of Y Better yet: use weighted (by distance) voting to predict the class label

Nearest Neighbor Example: Run Outside (+) or Inside (-) 100 ? - - Noisy data Not linearly separable ? - Decision tree boundary (not very good...) Humidity ? - ? + - ? ? - + - - + + ? + + - 100 Temperature

Naïve Bayes

Naïve Bayes Use Bayesian modeling Make the simplest possible independence assumption: Each attribute is independent of the values of the other attributes, given the class variable In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not)

Bayesian Formulation p(C | F1, ..., Fn) = p(C) p(F1, ..., Fn | C) / P(F1, ..., Fn) = α p(C) p(F1, ..., Fn | C) Assume that each feature Fi is conditionally independent of the other features given the class C. Then: p(C | F1, ..., Fn) = α p(C) Πi p(Fi | C) We can estimate each of these conditional probabilities from the observed counts in the training data: p(Fi | C) = N(Fi ∧ C) / N(C) One subtlety of using the algorithm in practice: When your estimated probabilities are zero, ugly things happen The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!)

Naive Bayes: Example p(Wait | Cuisine, Patrons, Rainy?) = α p(Wait) p(Cuisine | Wait) p(Patrons | Wait) p(Rainy? | Wait)

Naive Bayes: Analysis Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it) Remarkably, naive Bayes can outperform many much more complex algorithms—it’s a baseline that should pretty much always be used for comparison Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets!

Learning Bayesian Networks

Bayesian learning: Bayes’ rule Given some model space (set of hypotheses hi) and evidence (data D): P(hi|D) =  P(D|hi) P(hi) We assume that observations are independent of each other, given a model (hypothesis), so: P(hi|D) =  j P(dj|hi) P(hi) To predict the value of some unknown quantity, X (e.g., the class label for a future observation): P(X|D) = i P(X|D, hi) P(hi|D) = i P(X|hi) P(hi|D) These are equal by our independence assumption

Bayesian learning We can apply Bayesian learning in three basic ways: BMA (Bayesian Model Averaging): Don’t just choose one hypothesis; instead, make predictions based on the weighted average of all hypotheses (or some set of best hypotheses) MAP (Maximum A Posteriori) hypothesis: Choose the hypothesis with the highest a posteriori probability, given the data MLE (Maximum Likelihood Estimate): Assume that all hypotheses are equally likely a priori; then the best hypothesis is just the one that maximizes the likelihood (i.e., the probability of the data given the hypothesis) MDL (Minimum Description Length) principle: Use some encoding to model the complexity of the hypothesis, and the fit of the data to the hypothesis, then minimize the overall description of hi + D

Learning Bayesian networks Given training set Find B that best matches D model selection parameter estimation C A E B Inducer Data D

Parameter estimation Assume known structure Goal: estimate BN parameters q entries in local probability models, P(X | Parents(X)) A parameterization q is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose q* so as to maximize L i.i.d. samples

Parameter estimation II The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: for each value x of a node X and each instantiation u of Parents(X) Just need to collect the counts for every combination of parents and children observed in the data MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

Sufficient statistics: Example Moon-phase Why are the counts sufficient? Light-level Earthquake Burglary Alarm θ*A | E, B = N(A, E, B) / N(E, B)

Model selection Goal: Select the best network structure, given the data Input: Training data Scoring function Output: A network that maximizes the score

Structure selection: Scoring Bayesian: prior over parameters and structure get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Marginal likelihood Prior Same key property: Decomposability Score(structure) = Si Score(family of Xi)

Heuristic search B E A C Δscore(C) Add EC B E A C Δscore(A,E) Reverse EA Δscore(A) Delete EA B E A C B E A C

Exploiting decomposability Δscore(A) Delete EA To recompute scores, only need to re-score families that changed in the last move B E A C Δscore(C) Add EC B E A C Δscore(A) Reverse EA Δscore(A) Delete EA B E A C

Variations on a theme Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve!

Handling missing data Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Should we throw that data away?? Idea: Guess the missing values based on the other data Moon-phase Light-level Earthquake Burglary Alarm

EM (expectation maximization) Guess probabilities for nodes with missing values (e.g., based on other observations) Compute the probability distribution over the missing values, given our guess Update the probabilities based on the guessed values Repeat until convergence

EM example Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 We estimate the CPTs based on the rest of the data We then estimate P(Burglary) for November 27 from those CPTs Now we recompute the CPTs as if that estimated value had been observed Repeat until convergence! Earthquake Burglary Alarm

Unsupervised Learning: Clustering

Unsupervised learning Learn without a “supervisor” who labels instances Clustering Scientific discovery Pattern discovery Associative learning Clustering: Given a set of instances without labels, partition them such that each instance is: similar to other instances in its partition (intra-cluster similarity) dissimilar from instances in other partitions (inter-cluster dissimilarity)

Clustering techniques Agglomerative clustering Single-link clustering Complete-link clustering Average-link clustering Partitional clustering k-means clustering Spectral clustering

Agglomerative clustering Start with each instance in a cluster by itself Repeatedly combine pairs of clusters until some stopping criterion is reached (or until one “super-cluster” with substructure is found) Often used for non-fully-connected data sets (e.g., clustering in a social network) Single-link: At each step, combine the two clusters with the smallest minimum distance between any pair of instances in the two clusters (i.e., find the shortest “edge” between each pair of clusters Average-link: Combine the two clusters with the smallest average distance between all pairs of instances Complete-link: Combine the two clusters with the smallest maximum distance between any pair of instances

k-Means Partitional: k-Means: Start with all instances in a set, and find the “best” partition k-Means: Basic idea: use expectation maximization to find the best clusters Objective function: Minimize the within-cluster sum of squared distances Initialize k clusters by choosing k random instances as cluster “centroids” (where k is an input parameter) E-step: Assign each instance to its nearest cluster (using Euclidean distance to the centroid) M-step: Recompute the centroid as the center of mass of the instances in the cluster Repeat until convergence is achieved