Download presentation
Presentation is loading. Please wait.
Published byRolf Murphy Modified over 9 years ago
1
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心
2
课程基本信息 主讲教师:陈昱 chen_yu@pku.edu.cn Tel : 82529680 助教:程再兴, Tel : 62763742 wataloo@hotmail.com 课程网页: http://www.icst.pku.edu.cn/course/jiqixuexi/j qxx2011.mht 2
3
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 3
4
Introduction Bayesian learning is based on assumptions: Quantities of interest are governed by probability distribution Optimal decisions can be made by reasoning these distribution together with observed data 4
5
Why Study Bayesian Learning? Certain Bayesian learning algorithms such as Naive Bayes classifier are among most practical approaches to certain learning problems Bayesian methods provide a useful perspective for understanding many algorithms that don’t explicitly manipulate probabilities 5
6
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 6
7
Bayes’ Theorem Discrete distribution case: Given event A and B such that B has non-vanishing probability, P(A) is called prior probability in the sense that it is obtained without information of B P(A|B) is called conditional probability, or posterior probability of A, given B. P(B) is a normalization constant 7
8
Bayes’ Theorem (2) For continuous distributions, replacing probability by probability density function (p.d.f.), we have 8
9
Bayes’ Theorem (3) As a mathematical formula, Bayes' theorem is valid in all common interpretations of probability, however, frequentist and Bayesian interpretations disagree on how (and to what) probabilities are assigned. 9
10
Bayes’ Theorem (4) Frequentist: probabilities are the frequencies of occurrence of random events as proportions of a whole. (Objective) Statistics couldn’t be used in any but totally reproducible situations and it only uses empirical data In Bayesian interpretation, probabilities are rationally coherent degrees of belief, or a degree of belief in a proposition given a body of well-specified information. Bayes' theorem can then be understood as specifying how an ideally rational person responds to evidence. (Subjective) Real-life situations are always buried in contextual situations and can’t be repeated Bayes belongs to “Objective” camp! 10
11
Bayes’ Theorem (5) In ML, we want to determine the best hypothesis h from some space H, given observed training data D. One way of specifying “best” hypothesis is to interpret it as the most probable hypothesis given observed data and prior probability of various hypotheses in H. 11
12
Cox-Jaynes Axioms Assume it is possible to compute a meaningful degree of belief in hypo h 1, h 2, and h 3, given data D, by mathematical functions (not necessary probability) of form P(h 1 |D), P(h 2 |D), and P(h 3 |D). What are the minimum characteristics that we should demand of such functions? Cox-Jaynes Axioms 1. If P(h 1 |D)>P(h 2 |D), and P(h 2 |D)>P(h 3 |D), then P(h 1 |D)>P(h 3 |D). 2. P(¬h 1 |D)=f(P(h 1 |D)) for certain function f of degree of belief 3. P(h 1, h 2 |D)=g[P(h 1 |D), P(h 2 |D, h 1 )] for certain function g of degree of belief 12
13
Cox’s Theorem If belief function P, f, and g satisfy Cox- Jaynes Axioms, then we can choose scaling such that the smallest value of any proposition is 0, and the largest is 1; Furthermore, f(x)=1-x, and g(x,y)=xy. Corollary: From f & g, laws of probability follow. Negation and conjunction derive disjunction, but only finite additivity, not countable additivity. However, it is enough in practice. 13
14
Extended Reading David Mumford. “The Dawning of the Age of Stochasticity”, Mathematics: Frontiers and Perspectives, p.197-218, AMS, 2000. “Bayes rules”, Economist, Jan 5 th, 2006. On whether the brain copes with everyday judgments in the real world in a Bayesian manner, and Bayesian vs. frequentist. 14
15
Extended Reading (2) A book by Sharon Bertsch McGrayne: The Theory That Would Not Die How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy 15
16
MAP & ML hypotheses Maximum a posteriori (MAP) hypothesis: Maximum likelihood (ML) hypothesis: Remark: In case that every hypo is equally likely, MAP hypo becomes ML hypo. 16
17
A Simple Example Consider an example from medical diagnosis: Does the patient has a cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of cases in which the disease are actually present, and a correct negative result in 97% of cases in which the disease are not present. Furthermore, 0.008 of the entire population have this cancer. 17
18
A Simple Example (2) In other word, we are given the following information: P(cancer)=0.008, P(+|cancer)=0.98, and P(-|¬cancer)=0.97. Q: Suppose we observe a new patient from whom the test result is positive, should we diagnose the patient as having cancer or not? 18
19
A Simple Example (3) Consider MAP hypo: P(+|cancer)P(cancer)=0.98×0.008=0.0078 P(+|¬cancer)P(¬cancer)=0.03×0.992=0.0278 Therefore h MAP = ¬cancer Remark: diagnostic knowledge is often more fragile than causal knowledge. Bayes rule provides a way to update diagnostic knowledge. 19
20
ML & MAP Hypo for Binomial Problem: Estimate θ in b(n,θ), given that the event has occurred r times during the n number of experiments. ML hypo: 20
21
ML Hypo for Binomial (contd) Take derivative of the “ln” function and find its root: 21
22
MAP Hypo for Binomial Consider beta distribution Notice that beta function is the posterior p.d.f. of the parameter p in the binomial distribution b(α+β-2,p), assuming that the event has occurred α-1 times, and the prior p.d.f. of p is uniform. 22
23
More on Beta Function It is the conjugate prior of binomial distribution (special case of Dirichlet distribution with only two parameters) 23
24
MAP Hypo for Binomial (2) Therefore it is reasonable to assume the prior p.d.f. of θ is beta function, that say, with parameter α & β. It follows that: The posterior p.d.f. of θ is beta function with parameter α+r & β+n-r, and Rmk: when n is large enough, the prior doesn’t matter, but it does matter when n is small. 24
25
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 25
26
Learn a Real-Valued Function Consider any real-valued target function f. Training examples (x i,d i ) are assumed to have Normally distributed noise e i with zero mean and variance σ 2, added to the true target value f(x i ), in other word, d i satisfies N(f(x i ), σ 2 ). Furthermore, assume that e i is drawn independently for each x i. 26
27
Compute ML Hypo 27
28
In Case of Learning a Linear Function 28 In above figure the solid line denotes target function, the dot line denotes learned ML hypo, and dots denote (noisy) training points.
29
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 29
30
Statement of the Problem Assume that we want to learn a nondeterministic function f: X → {0,1}. For example, X might present patients in terms of their medical symptoms, and f(x)=1 if the patient survives the disease, and 0 otherwise. f is nondeterministic in the sense that, among a collection of patients exhibiting the same set of observable symptoms, we find that only certain percentage of patients have survived. Consider the above example, we can model the function to be learned as a probability function P(f(x)=1) (i.e. the probability that a patient with symptom x will survive), given a set of training examples. Furthermore, we model the training set as {(x i,d i )|i=1, …,n}, where both x i and d i are random variables, and d i takes value 1 or 0. 30
31
ML Hypo for the Probability 31 The formula inside the ∑ is called cross entropy
32
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 32
33
Ockham’s razor Q: How do we choose from among multiple consistent hypotheses? Ockham’s razor: Prefer the simplest hypothesis consistent with the data. 33 William of Ockham
34
Pro and Con of Ockham’s razor Pro Fewer short hypos than long hypos → Less likely that it coincidently fits the data Con There are many ways to define small set of hypos E.g. consider the following (peculiar) set of decision trees: 17 leafs and 11 non-leafs, with attribute A 1 as its root, and testing A 2 through A 11 in the numerical order. → What is so special about small sets based on size of hypos? The size of a hypo is determined by the representation used internally by the learner 34
35
MAP from Viewpoint of Information Theory Shannon’s optimal coding theorem: Given a class of signal I, the coding scheme for such signals, for which a random signal has the smallest expected length, satisfies: 35
36
MAP from Viewpoint of Information Theory (2) -log 2 P(h) is the description length of h under the optimal coding for hypo space H -log 2 P(D|h) is the description length of training data D given h, under its optimal coding. MAP hypo prefers hypo that minimize length(h)+length(misclassification) 36
37
MDL Principle It recommends choosing the hypothesis that minimizing the sum of two description lengths. Assume we choose coding C 1 & C 2 to the hypo and training data given hypo, respectively, we can state MDL principle as 37
38
Decision Tree as an Example How to select “best” tree? Measure performance over both training data and separate validation data, or Apply MDL principle: minimize size(tree)+size(misclassification(tree)) MDL-based methods produced learned trees whose accuracy were comparable to those produced by standard tree-pruning methods (Quinlan & Rivest, 1989; Mehta et. al, 1995) 38
39
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 39
40
Most Probable Classification for New Instance Given new instance x, what is its most probable classification? h MAP (x) might not be the answer! Consider: Three possible hypo: P(h 1 |D)=.4, P(h 2 |D)=.3, and P(h 3 |D)=.3 Given new instance x s.t. h 1 (x)=“+”, h 2 (x)=“-”, and h 3 (x)=“-” What is the most probable classification for x? 40
41
Bayes Optimal Classifier Assume the possible classification of a new instance can take any value v j from set V, it follows that Bayes optimal classification: In the above example 41
42
Why “Optimal”? Optimal in the sense that no other classifier using the same H and prior knowledge can outperform it on average 42
43
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 43
44
Gibbs Algorithm Bayes optimal classifier is quite computationally expensive, if H contains a large number of hypotheses. An alternative, less optimal classifier Gibbs algorithm, defined as follows: 1. Choose a hypo randomly according to P(h|D), where D is the posterior probability distribution over H. 2. Use it to classify new instance 44
45
Error for Gibbs Algorithm Surprising fact: Assume the expected value is taken over target concepts drawn at random, according to the prior probability distribution assumed by the learner, then (Haussler et al. 1994) 45
46
Implication for Concept learning Version space VS H,D is defined to be the set of all hypo h ∈ H that correctly classify all training examples in D. Corollary: If the learner assumes uniform prior over H, and if target concepts are drawn according to such prior when presented to the learner, then classifying a new instance according to a hypo drawn according to uniform distribution from version space, will have expected error at most twice of that of Bayes optimal classifier. 46
47
Ch6 Bayesian Learning Introduction Bayes’ Theorem Maximum likelihood and least-squared error Maximum likelihood hypotheses for predicting probabilities Minimum description length (MDL) principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier 47
48
Problem Setting Consider a learning task where each instance is described by a conjunction of attributes, and the target function f takes value from a finite set V. A training set is provided and a new instance x is presented. The learner is asked to predict f(x). Example: 48
49
Bayesian Approach Estimating the likelihood probability distribution is not feasible unless we have a large set of training data However, it is rather easy to estimate single P(a i |v j ) just simply by counting the frequency Here comes Naïve Bayesian classifier! 49
50
Naïve Bayes Classifier Naïve Bayes classifier is based on assumption that attribute values are conditionally independent given the target value In Naïve Bayes learning there is no explicit searching through the hypo space. All probabilities are estimated by counting frequencies. 50
51
An Illustrative Example Consider “PlayTennis” example in Ch2 51
52
Example (2) Given new instance, what is its target value labeled by Naïve Bayes Classifier? 52
53
Subtleties in Naïve Bayes 1. Conditional independence is often violated However, Naïve Bayes works surprising well anyway. Note: what we really need is NOT but just See [Domingos & Pazzani, 1996] for an analysis of conditions 53
54
Subtleties (2) 2. Naïve Bayes posteriors often unrealistically close to 1 or 0. For example, what if none of training instances with target value v j have attribute value a i ? Then the multiplication becomes 0! “Cold Start” phenomenon One typical solution is m-estimate for P(a i |v j ): P(a i |v j )=(n c +mp)/(n+m), where n is the number of training examples for which v=v j n c number of examples for which v=v j & a=a i p is prior estimate for P(a i |v j ) m is weight given to prior (number of “virtual” examples) 54
55
Something Good about Naïve Bayes Naïve Bayes, together with decision tree, neural network, and SVM, are most popular classifiers in machine learning A boosted version of Naïve Bayes is one of the most effective general-purpose learning algorithm 55
56
A Comparison Naïve Bayes vs. decision tree on a problem similar to PlayTennis, from “Artificial Intelligence” by S. Russell. 56
57
An Example: Text Classification Sample classification problems: Which news articles are of interest Classifying Web pages by topic Classical approaches (non-Bayesian, statistical text learning algorithms) from information retrieval: TF-IDF (Term Frequency-Inverse Document Frequency) PRTFIDF (Probabilistic TFIDF) Naïve Bayes Classifier SVM 57
58
Statement of the Problem Target function: doc → V E.g. the target concept is interesting, then V={0,1}; the target concept is NewsGroup title, then V={comp.graphics, sci-space, alt.atheism,…}; A natural way to represent a document is to denote it by a vector of words “Naïve” representation: one attribute per word position in Doc Learning: Use training examples to estimate P(v j ) and P(doc|v j ) for every v j ∈ V. 58
59
Naïve Bayes Approach Conditional independence assumption leads to To simplify the above formula, assume that is independent of the position i (also called “bag of word” model) However, such simplification is questionable in cases e.g. the doc has certain structure and we want to utilize that during learning. 59
60
The Algorithm Learn_Naive_Bayes_Text (Examples, V) 1. Collect all words and other tokens that occur in Examples. Vocabulary ← all distinct words and other tokens in Examples. 2. Calculate the required P(v j ) and P(w k |v j ) terms For each target value v j in V do docs j ← subsets of Examples for which the target value is v j P(v j ) ← #docs j /#Examples Text j ← a single document created by concatenating all members of docs j n ← total number of distinct word positions inText j for each word w k in Vocabulary n k ← number of times word w k occurs in Text j P(w k |v j ) ← (n k +1)/(n+#Vocabulary) 60
61
The Algorithm (2) Remark: In the formula for calculating P(w k |v j ), we assume that every distinct word equally likely appears in Examples, however, a more reasonable assumption is to replace 1/#Vocabulary by the frequency of word w k appearing in English literature *. Classify_Naive_Bayes_Text (Doc) Return the estimated target value for the document Doc. positions ← all word positions in Doc that contain tokens found in Vocabulary Returns v NB 61
62
Further Improvement to the Algorithm Normalizing input documents first (removing trivial words, normalizing non-trivial words, etc) Consider positions of words inside documents (such as layout of Web pages), temporal patterns of words, etc. Consider correlation of words 62
63
Summary Bayes Theorem ML and MAP hypotheses Minimum description principle (MDL) Bayes Optimal Classifier Bayesian learning framework Naïve Bayes Classifier 63
64
Further Reading Domingos and Pazzani (1996): conditions under which Naïve Bayes will output optimal classification Cestnik (1990): m-estimate Mitchie et al. (1994), Chauvin et al. (1995): Bayesian approach to other learning algorithms such as decision tree, neural network, etc. On relation between Naïve Bayes and “Bag of Word”: Lewis, David (1998). "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval". Proceedings of ECML-98, 10th European Conference on Machine Learning. Chemnitz, DE: Springer Verlag, Heidelberg, DE. pp. 4–15. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval 64
65
HW 6.5(10pt, Due Monday, Oct 31) 65
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.