机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel : 82529680  助教:程再兴, Tel : 62763742  课程网页:

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayesian Learning Rong Jin.
Experimental Evaluation
Bayes Classification.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Machine Learning Chapter 3. Decision Tree Learning
Bayesian Networks. Male brain wiring Female brain wiring.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
Inductive learning Simplest form: learn a function from examples
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Statistical Decision Theory
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 6 Bayesian Learning
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Bayesian Learning Provides practical learning algorithms
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Naive Bayes Classifier
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心

课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页: qxx2011.mht 2

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 3

Introduction  Bayesian learning is based on assumptions: Quantities of interest are governed by probability distribution Optimal decisions can be made by reasoning these distribution together with observed data 4

Why Study Bayesian Learning?  Certain Bayesian learning algorithms such as Naive Bayes classifier are among most practical approaches to certain learning problems  Bayesian methods provide a useful perspective for understanding many algorithms that don’t explicitly manipulate probabilities 5

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 6

Bayes’ Theorem Discrete distribution case: Given event A and B such that B has non-vanishing probability,  P(A) is called prior probability in the sense that it is obtained without information of B  P(A|B) is called conditional probability, or posterior probability of A, given B.  P(B) is a normalization constant 7

Bayes’ Theorem (2)  For continuous distributions, replacing probability by probability density function (p.d.f.), we have 8

Bayes’ Theorem (3)  As a mathematical formula, Bayes' theorem is valid in all common interpretations of probability, however, frequentist and Bayesian interpretations disagree on how (and to what) probabilities are assigned. 9

Bayes’ Theorem (4)  Frequentist: probabilities are the frequencies of occurrence of random events as proportions of a whole. (Objective) Statistics couldn’t be used in any but totally reproducible situations and it only uses empirical data  In Bayesian interpretation, probabilities are rationally coherent degrees of belief, or a degree of belief in a proposition given a body of well-specified information. Bayes' theorem can then be understood as specifying how an ideally rational person responds to evidence. (Subjective) Real-life situations are always buried in contextual situations and can’t be repeated  Bayes belongs to “Objective” camp! 10

Bayes’ Theorem (5)  In ML, we want to determine the best hypothesis h from some space H, given observed training data D.  One way of specifying “best” hypothesis is to interpret it as the most probable hypothesis given observed data and prior probability of various hypotheses in H. 11

Cox-Jaynes Axioms  Assume it is possible to compute a meaningful degree of belief in hypo h 1, h 2, and h 3, given data D, by mathematical functions (not necessary probability) of form P(h 1 |D), P(h 2 |D), and P(h 3 |D). What are the minimum characteristics that we should demand of such functions?  Cox-Jaynes Axioms 1. If P(h 1 |D)>P(h 2 |D), and P(h 2 |D)>P(h 3 |D), then P(h 1 |D)>P(h 3 |D). 2. P(¬h 1 |D)=f(P(h 1 |D)) for certain function f of degree of belief 3. P(h 1, h 2 |D)=g[P(h 1 |D), P(h 2 |D, h 1 )] for certain function g of degree of belief 12

Cox’s Theorem  If belief function P, f, and g satisfy Cox- Jaynes Axioms, then we can choose scaling such that the smallest value of any proposition is 0, and the largest is 1; Furthermore, f(x)=1-x, and g(x,y)=xy.  Corollary: From f & g, laws of probability follow. Negation and conjunction derive disjunction, but only finite additivity, not countable additivity. However, it is enough in practice. 13

Extended Reading  David Mumford. “The Dawning of the Age of Stochasticity”, Mathematics: Frontiers and Perspectives, p , AMS,  “Bayes rules”, Economist, Jan 5 th, On whether the brain copes with everyday judgments in the real world in a Bayesian manner, and Bayesian vs. frequentist. 14

Extended Reading (2)  A book by Sharon Bertsch McGrayne: The Theory That Would Not Die How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy 15

MAP & ML hypotheses  Maximum a posteriori (MAP) hypothesis:  Maximum likelihood (ML) hypothesis:  Remark: In case that every hypo is equally likely, MAP hypo becomes ML hypo. 16

A Simple Example  Consider an example from medical diagnosis: Does the patient has a cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of cases in which the disease are actually present, and a correct negative result in 97% of cases in which the disease are not present. Furthermore, of the entire population have this cancer. 17

A Simple Example (2)  In other word, we are given the following information: P(cancer)=0.008, P(+|cancer)=0.98, and P(-|¬cancer)=0.97.  Q: Suppose we observe a new patient from whom the test result is positive, should we diagnose the patient as having cancer or not? 18

A Simple Example (3)  Consider MAP hypo: P(+|cancer)P(cancer)=0.98×0.008= P(+|¬cancer)P(¬cancer)=0.03×0.992=  Therefore h MAP = ¬cancer  Remark: diagnostic knowledge is often more fragile than causal knowledge. Bayes rule provides a way to update diagnostic knowledge. 19

ML & MAP Hypo for Binomial  Problem: Estimate θ in b(n,θ), given that the event has occurred r times during the n number of experiments.  ML hypo: 20

ML Hypo for Binomial (contd)  Take derivative of the “ln” function and find its root: 21

MAP Hypo for Binomial  Consider beta distribution  Notice that beta function is the posterior p.d.f. of the parameter p in the binomial distribution b(α+β-2,p), assuming that the event has occurred α-1 times, and the prior p.d.f. of p is uniform. 22

More on Beta Function  It is the conjugate prior of binomial distribution (special case of Dirichlet distribution with only two parameters) 23

MAP Hypo for Binomial (2)  Therefore it is reasonable to assume the prior p.d.f. of θ is beta function, that say, with parameter α & β.  It follows that: The posterior p.d.f. of θ is beta function with parameter α+r & β+n-r, and  Rmk: when n is large enough, the prior doesn’t matter, but it does matter when n is small. 24

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 25

Learn a Real-Valued Function  Consider any real-valued target function f. Training examples (x i,d i ) are assumed to have Normally distributed noise e i with zero mean and variance σ 2, added to the true target value f(x i ), in other word, d i satisfies N(f(x i ), σ 2 ). Furthermore, assume that e i is drawn independently for each x i. 26

Compute ML Hypo 27

In Case of Learning a Linear Function 28 In above figure the solid line denotes target function, the dot line denotes learned ML hypo, and dots denote (noisy) training points.

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 29

Statement of the Problem  Assume that we want to learn a nondeterministic function f: X → {0,1}. For example, X might present patients in terms of their medical symptoms, and f(x)=1 if the patient survives the disease, and 0 otherwise. f is nondeterministic in the sense that, among a collection of patients exhibiting the same set of observable symptoms, we find that only certain percentage of patients have survived.  Consider the above example, we can model the function to be learned as a probability function P(f(x)=1) (i.e. the probability that a patient with symptom x will survive), given a set of training examples. Furthermore, we model the training set as {(x i,d i )|i=1, …,n}, where both x i and d i are random variables, and d i takes value 1 or 0. 30

ML Hypo for the Probability 31 The formula inside the ∑ is called cross entropy

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 32

Ockham’s razor  Q: How do we choose from among multiple consistent hypotheses?  Ockham’s razor: Prefer the simplest hypothesis consistent with the data. 33 William of Ockham

Pro and Con of Ockham’s razor  Pro Fewer short hypos than long hypos → Less likely that it coincidently fits the data  Con There are many ways to define small set of hypos E.g. consider the following (peculiar) set of decision trees: 17 leafs and 11 non-leafs, with attribute A 1 as its root, and testing A 2 through A 11 in the numerical order. → What is so special about small sets based on size of hypos? The size of a hypo is determined by the representation used internally by the learner 34

MAP from Viewpoint of Information Theory  Shannon’s optimal coding theorem: Given a class of signal I, the coding scheme for such signals, for which a random signal has the smallest expected length, satisfies: 35

MAP from Viewpoint of Information Theory (2)  -log 2 P(h) is the description length of h under the optimal coding for hypo space H  -log 2 P(D|h) is the description length of training data D given h, under its optimal coding.  MAP hypo prefers hypo that minimize length(h)+length(misclassification) 36

MDL Principle  It recommends choosing the hypothesis that minimizing the sum of two description lengths. Assume we choose coding C 1 & C 2 to the hypo and training data given hypo, respectively, we can state MDL principle as 37

Decision Tree as an Example  How to select “best” tree? Measure performance over both training data and separate validation data, or Apply MDL principle: minimize size(tree)+size(misclassification(tree))  MDL-based methods produced learned trees whose accuracy were comparable to those produced by standard tree-pruning methods (Quinlan & Rivest, 1989; Mehta et. al, 1995) 38

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 39

Most Probable Classification for New Instance  Given new instance x, what is its most probable classification?  h MAP (x) might not be the answer! Consider:  Three possible hypo: P(h 1 |D)=.4, P(h 2 |D)=.3, and P(h 3 |D)=.3  Given new instance x s.t. h 1 (x)=“+”, h 2 (x)=“-”, and h 3 (x)=“-” What is the most probable classification for x? 40

Bayes Optimal Classifier  Assume the possible classification of a new instance can take any value v j from set V, it follows that  Bayes optimal classification:  In the above example 41

Why “Optimal”?  Optimal in the sense that no other classifier using the same H and prior knowledge can outperform it on average 42

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 43

Gibbs Algorithm  Bayes optimal classifier is quite computationally expensive, if H contains a large number of hypotheses.  An alternative, less optimal classifier Gibbs algorithm, defined as follows: 1. Choose a hypo randomly according to P(h|D), where D is the posterior probability distribution over H. 2. Use it to classify new instance 44

Error for Gibbs Algorithm  Surprising fact: Assume the expected value is taken over target concepts drawn at random, according to the prior probability distribution assumed by the learner, then (Haussler et al. 1994) 45

Implication for Concept learning  Version space VS H,D is defined to be the set of all hypo h ∈ H that correctly classify all training examples in D.  Corollary: If the learner assumes uniform prior over H, and if target concepts are drawn according to such prior when presented to the learner, then classifying a new instance according to a hypo drawn according to uniform distribution from version space, will have expected error at most twice of that of Bayes optimal classifier. 46

Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 47

Problem Setting  Consider a learning task where each instance is described by a conjunction of attributes, and the target function f takes value from a finite set V. A training set is provided and a new instance x is presented. The learner is asked to predict f(x).  Example: 48

Bayesian Approach  Estimating the likelihood probability distribution is not feasible unless we have a large set of training data  However, it is rather easy to estimate single P(a i |v j ) just simply by counting the frequency  Here comes Naïve Bayesian classifier! 49

Naïve Bayes Classifier  Naïve Bayes classifier is based on assumption that attribute values are conditionally independent given the target value  In Naïve Bayes learning there is no explicit searching through the hypo space. All probabilities are estimated by counting frequencies. 50

An Illustrative Example  Consider “PlayTennis” example in Ch2 51

Example (2)  Given new instance, what is its target value labeled by Naïve Bayes Classifier? 52

Subtleties in Naïve Bayes 1. Conditional independence is often violated  However, Naïve Bayes works surprising well anyway. Note: what we really need is NOT but just  See [Domingos & Pazzani, 1996] for an analysis of conditions 53

Subtleties (2) 2. Naïve Bayes posteriors often unrealistically close to 1 or 0. For example, what if none of training instances with target value v j have attribute value a i ? Then the multiplication becomes 0! “Cold Start” phenomenon  One typical solution is m-estimate for P(a i |v j ): P(a i |v j )=(n c +mp)/(n+m), where n is the number of training examples for which v=v j n c number of examples for which v=v j & a=a i p is prior estimate for P(a i |v j ) m is weight given to prior (number of “virtual” examples) 54

Something Good about Naïve Bayes  Naïve Bayes, together with decision tree, neural network, and SVM, are most popular classifiers in machine learning  A boosted version of Naïve Bayes is one of the most effective general-purpose learning algorithm 55

A Comparison  Naïve Bayes vs. decision tree on a problem similar to PlayTennis, from “Artificial Intelligence” by S. Russell. 56

An Example: Text Classification  Sample classification problems: Which news articles are of interest Classifying Web pages by topic  Classical approaches (non-Bayesian, statistical text learning algorithms) from information retrieval: TF-IDF (Term Frequency-Inverse Document Frequency) PRTFIDF (Probabilistic TFIDF)  Naïve Bayes Classifier  SVM 57

Statement of the Problem  Target function: doc → V E.g. the target concept is interesting, then V={0,1}; the target concept is NewsGroup title, then V={comp.graphics, sci-space, alt.atheism,…};  A natural way to represent a document is to denote it by a vector of words “Naïve” representation: one attribute per word position in Doc  Learning: Use training examples to estimate P(v j ) and P(doc|v j ) for every v j ∈ V. 58

Naïve Bayes Approach  Conditional independence assumption leads to  To simplify the above formula, assume that is independent of the position i (also called “bag of word” model) However, such simplification is questionable in cases e.g. the doc has certain structure and we want to utilize that during learning. 59

The Algorithm Learn_Naive_Bayes_Text (Examples, V) 1. Collect all words and other tokens that occur in Examples. Vocabulary ← all distinct words and other tokens in Examples. 2. Calculate the required P(v j ) and P(w k |v j ) terms For each target value v j in V do docs j ← subsets of Examples for which the target value is v j P(v j ) ← #docs j /#Examples Text j ← a single document created by concatenating all members of docs j n ← total number of distinct word positions inText j for each word w k in Vocabulary n k ← number of times word w k occurs in Text j P(w k |v j ) ← (n k +1)/(n+#Vocabulary) 60

The Algorithm (2)  Remark: In the formula for calculating P(w k |v j ), we assume that every distinct word equally likely appears in Examples, however, a more reasonable assumption is to replace 1/#Vocabulary by the frequency of word w k appearing in English literature *. Classify_Naive_Bayes_Text (Doc) Return the estimated target value for the document Doc.  positions ← all word positions in Doc that contain tokens found in Vocabulary  Returns v NB 61

Further Improvement to the Algorithm  Normalizing input documents first (removing trivial words, normalizing non-trivial words, etc)  Consider positions of words inside documents (such as layout of Web pages), temporal patterns of words, etc.  Consider correlation of words 62

Summary  Bayes Theorem  ML and MAP hypotheses  Minimum description principle (MDL)  Bayes Optimal Classifier  Bayesian learning framework  Naïve Bayes Classifier 63

Further Reading  Domingos and Pazzani (1996): conditions under which Naïve Bayes will output optimal classification  Cestnik (1990): m-estimate  Mitchie et al. (1994), Chauvin et al. (1995): Bayesian approach to other learning algorithms such as decision tree, neural network, etc.  On relation between Naïve Bayes and “Bag of Word”: Lewis, David (1998). "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval". Proceedings of ECML-98, 10th European Conference on Machine Learning. Chemnitz, DE: Springer Verlag, Heidelberg, DE. pp. 4–15. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval 64

HW  6.5(10pt, Due Monday, Oct 31) 65