Presentation is loading. Please wait.

Presentation is loading. Please wait.

机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel : 82529680  助教:程再兴, Tel : 62763742  课程网页:

Similar presentations


Presentation on theme: "机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel : 82529680  助教:程再兴, Tel : 62763742  课程网页:"— Presentation transcript:

1 机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心

2 课程基本信息  主讲教师:陈昱 chen_yu@pku.edu.cn Tel : 82529680  助教:程再兴, Tel : 62763742 wataloo@hotmail.com  课程网页: http://www.icst.pku.edu.cn/course/jiqixuexi/j qxx2011.mht 2

3 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 3

4 Introduction  Bayesian learning is based on assumptions: Quantities of interest are governed by probability distribution Optimal decisions can be made by reasoning these distribution together with observed data 4

5 Why Study Bayesian Learning?  Certain Bayesian learning algorithms such as Naive Bayes classifier are among most practical approaches to certain learning problems  Bayesian methods provide a useful perspective for understanding many algorithms that don’t explicitly manipulate probabilities 5

6 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 6

7 Bayes’ Theorem Discrete distribution case: Given event A and B such that B has non-vanishing probability,  P(A) is called prior probability in the sense that it is obtained without information of B  P(A|B) is called conditional probability, or posterior probability of A, given B.  P(B) is a normalization constant 7

8 Bayes’ Theorem (2)  For continuous distributions, replacing probability by probability density function (p.d.f.), we have 8

9 Bayes’ Theorem (3)  As a mathematical formula, Bayes' theorem is valid in all common interpretations of probability, however, frequentist and Bayesian interpretations disagree on how (and to what) probabilities are assigned. 9

10 Bayes’ Theorem (4)  Frequentist: probabilities are the frequencies of occurrence of random events as proportions of a whole. (Objective) Statistics couldn’t be used in any but totally reproducible situations and it only uses empirical data  In Bayesian interpretation, probabilities are rationally coherent degrees of belief, or a degree of belief in a proposition given a body of well-specified information. Bayes' theorem can then be understood as specifying how an ideally rational person responds to evidence. (Subjective) Real-life situations are always buried in contextual situations and can’t be repeated  Bayes belongs to “Objective” camp! 10

11 Bayes’ Theorem (5)  In ML, we want to determine the best hypothesis h from some space H, given observed training data D.  One way of specifying “best” hypothesis is to interpret it as the most probable hypothesis given observed data and prior probability of various hypotheses in H. 11

12 Cox-Jaynes Axioms  Assume it is possible to compute a meaningful degree of belief in hypo h 1, h 2, and h 3, given data D, by mathematical functions (not necessary probability) of form P(h 1 |D), P(h 2 |D), and P(h 3 |D). What are the minimum characteristics that we should demand of such functions?  Cox-Jaynes Axioms 1. If P(h 1 |D)>P(h 2 |D), and P(h 2 |D)>P(h 3 |D), then P(h 1 |D)>P(h 3 |D). 2. P(¬h 1 |D)=f(P(h 1 |D)) for certain function f of degree of belief 3. P(h 1, h 2 |D)=g[P(h 1 |D), P(h 2 |D, h 1 )] for certain function g of degree of belief 12

13 Cox’s Theorem  If belief function P, f, and g satisfy Cox- Jaynes Axioms, then we can choose scaling such that the smallest value of any proposition is 0, and the largest is 1; Furthermore, f(x)=1-x, and g(x,y)=xy.  Corollary: From f & g, laws of probability follow. Negation and conjunction derive disjunction, but only finite additivity, not countable additivity. However, it is enough in practice. 13

14 Extended Reading  David Mumford. “The Dawning of the Age of Stochasticity”, Mathematics: Frontiers and Perspectives, p.197-218, AMS, 2000.  “Bayes rules”, Economist, Jan 5 th, 2006. On whether the brain copes with everyday judgments in the real world in a Bayesian manner, and Bayesian vs. frequentist. 14

15 Extended Reading (2)  A book by Sharon Bertsch McGrayne: The Theory That Would Not Die How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy 15

16 MAP & ML hypotheses  Maximum a posteriori (MAP) hypothesis:  Maximum likelihood (ML) hypothesis:  Remark: In case that every hypo is equally likely, MAP hypo becomes ML hypo. 16

17 A Simple Example  Consider an example from medical diagnosis: Does the patient has a cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of cases in which the disease are actually present, and a correct negative result in 97% of cases in which the disease are not present. Furthermore, 0.008 of the entire population have this cancer. 17

18 A Simple Example (2)  In other word, we are given the following information: P(cancer)=0.008, P(+|cancer)=0.98, and P(-|¬cancer)=0.97.  Q: Suppose we observe a new patient from whom the test result is positive, should we diagnose the patient as having cancer or not? 18

19 A Simple Example (3)  Consider MAP hypo: P(+|cancer)P(cancer)=0.98×0.008=0.0078 P(+|¬cancer)P(¬cancer)=0.03×0.992=0.0278  Therefore h MAP = ¬cancer  Remark: diagnostic knowledge is often more fragile than causal knowledge. Bayes rule provides a way to update diagnostic knowledge. 19

20 ML & MAP Hypo for Binomial  Problem: Estimate θ in b(n,θ), given that the event has occurred r times during the n number of experiments.  ML hypo: 20

21 ML Hypo for Binomial (contd)  Take derivative of the “ln” function and find its root: 21

22 MAP Hypo for Binomial  Consider beta distribution  Notice that beta function is the posterior p.d.f. of the parameter p in the binomial distribution b(α+β-2,p), assuming that the event has occurred α-1 times, and the prior p.d.f. of p is uniform. 22

23 More on Beta Function  It is the conjugate prior of binomial distribution (special case of Dirichlet distribution with only two parameters) 23

24 MAP Hypo for Binomial (2)  Therefore it is reasonable to assume the prior p.d.f. of θ is beta function, that say, with parameter α & β.  It follows that: The posterior p.d.f. of θ is beta function with parameter α+r & β+n-r, and  Rmk: when n is large enough, the prior doesn’t matter, but it does matter when n is small. 24

25 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 25

26 Learn a Real-Valued Function  Consider any real-valued target function f. Training examples (x i,d i ) are assumed to have Normally distributed noise e i with zero mean and variance σ 2, added to the true target value f(x i ), in other word, d i satisfies N(f(x i ), σ 2 ). Furthermore, assume that e i is drawn independently for each x i. 26

27 Compute ML Hypo 27

28 In Case of Learning a Linear Function 28 In above figure the solid line denotes target function, the dot line denotes learned ML hypo, and dots denote (noisy) training points.

29 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 29

30 Statement of the Problem  Assume that we want to learn a nondeterministic function f: X → {0,1}. For example, X might present patients in terms of their medical symptoms, and f(x)=1 if the patient survives the disease, and 0 otherwise. f is nondeterministic in the sense that, among a collection of patients exhibiting the same set of observable symptoms, we find that only certain percentage of patients have survived.  Consider the above example, we can model the function to be learned as a probability function P(f(x)=1) (i.e. the probability that a patient with symptom x will survive), given a set of training examples. Furthermore, we model the training set as {(x i,d i )|i=1, …,n}, where both x i and d i are random variables, and d i takes value 1 or 0. 30

31 ML Hypo for the Probability 31 The formula inside the ∑ is called cross entropy

32 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 32

33 Ockham’s razor  Q: How do we choose from among multiple consistent hypotheses?  Ockham’s razor: Prefer the simplest hypothesis consistent with the data. 33 William of Ockham

34 Pro and Con of Ockham’s razor  Pro Fewer short hypos than long hypos → Less likely that it coincidently fits the data  Con There are many ways to define small set of hypos E.g. consider the following (peculiar) set of decision trees: 17 leafs and 11 non-leafs, with attribute A 1 as its root, and testing A 2 through A 11 in the numerical order. → What is so special about small sets based on size of hypos? The size of a hypo is determined by the representation used internally by the learner 34

35 MAP from Viewpoint of Information Theory  Shannon’s optimal coding theorem: Given a class of signal I, the coding scheme for such signals, for which a random signal has the smallest expected length, satisfies: 35

36 MAP from Viewpoint of Information Theory (2)  -log 2 P(h) is the description length of h under the optimal coding for hypo space H  -log 2 P(D|h) is the description length of training data D given h, under its optimal coding.  MAP hypo prefers hypo that minimize length(h)+length(misclassification) 36

37 MDL Principle  It recommends choosing the hypothesis that minimizing the sum of two description lengths. Assume we choose coding C 1 & C 2 to the hypo and training data given hypo, respectively, we can state MDL principle as 37

38 Decision Tree as an Example  How to select “best” tree? Measure performance over both training data and separate validation data, or Apply MDL principle: minimize size(tree)+size(misclassification(tree))  MDL-based methods produced learned trees whose accuracy were comparable to those produced by standard tree-pruning methods (Quinlan & Rivest, 1989; Mehta et. al, 1995) 38

39 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 39

40 Most Probable Classification for New Instance  Given new instance x, what is its most probable classification?  h MAP (x) might not be the answer! Consider:  Three possible hypo: P(h 1 |D)=.4, P(h 2 |D)=.3, and P(h 3 |D)=.3  Given new instance x s.t. h 1 (x)=“+”, h 2 (x)=“-”, and h 3 (x)=“-” What is the most probable classification for x? 40

41 Bayes Optimal Classifier  Assume the possible classification of a new instance can take any value v j from set V, it follows that  Bayes optimal classification:  In the above example 41

42 Why “Optimal”?  Optimal in the sense that no other classifier using the same H and prior knowledge can outperform it on average 42

43 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 43

44 Gibbs Algorithm  Bayes optimal classifier is quite computationally expensive, if H contains a large number of hypotheses.  An alternative, less optimal classifier Gibbs algorithm, defined as follows: 1. Choose a hypo randomly according to P(h|D), where D is the posterior probability distribution over H. 2. Use it to classify new instance 44

45 Error for Gibbs Algorithm  Surprising fact: Assume the expected value is taken over target concepts drawn at random, according to the prior probability distribution assumed by the learner, then (Haussler et al. 1994) 45

46 Implication for Concept learning  Version space VS H,D is defined to be the set of all hypo h ∈ H that correctly classify all training examples in D.  Corollary: If the learner assumes uniform prior over H, and if target concepts are drawn according to such prior when presented to the learner, then classifying a new instance according to a hypo drawn according to uniform distribution from version space, will have expected error at most twice of that of Bayes optimal classifier. 46

47 Ch6 Bayesian Learning  Introduction  Bayes’ Theorem  Maximum likelihood and least-squared error  Maximum likelihood hypotheses for predicting probabilities  Minimum description length (MDL) principle  Bayes optimal classifier  Gibbs algorithm  Naive Bayes classifier 47

48 Problem Setting  Consider a learning task where each instance is described by a conjunction of attributes, and the target function f takes value from a finite set V. A training set is provided and a new instance x is presented. The learner is asked to predict f(x).  Example: 48

49 Bayesian Approach  Estimating the likelihood probability distribution is not feasible unless we have a large set of training data  However, it is rather easy to estimate single P(a i |v j ) just simply by counting the frequency  Here comes Naïve Bayesian classifier! 49

50 Naïve Bayes Classifier  Naïve Bayes classifier is based on assumption that attribute values are conditionally independent given the target value  In Naïve Bayes learning there is no explicit searching through the hypo space. All probabilities are estimated by counting frequencies. 50

51 An Illustrative Example  Consider “PlayTennis” example in Ch2 51

52 Example (2)  Given new instance, what is its target value labeled by Naïve Bayes Classifier? 52

53 Subtleties in Naïve Bayes 1. Conditional independence is often violated  However, Naïve Bayes works surprising well anyway. Note: what we really need is NOT but just  See [Domingos & Pazzani, 1996] for an analysis of conditions 53

54 Subtleties (2) 2. Naïve Bayes posteriors often unrealistically close to 1 or 0. For example, what if none of training instances with target value v j have attribute value a i ? Then the multiplication becomes 0! “Cold Start” phenomenon  One typical solution is m-estimate for P(a i |v j ): P(a i |v j )=(n c +mp)/(n+m), where n is the number of training examples for which v=v j n c number of examples for which v=v j & a=a i p is prior estimate for P(a i |v j ) m is weight given to prior (number of “virtual” examples) 54

55 Something Good about Naïve Bayes  Naïve Bayes, together with decision tree, neural network, and SVM, are most popular classifiers in machine learning  A boosted version of Naïve Bayes is one of the most effective general-purpose learning algorithm 55

56 A Comparison  Naïve Bayes vs. decision tree on a problem similar to PlayTennis, from “Artificial Intelligence” by S. Russell. 56

57 An Example: Text Classification  Sample classification problems: Which news articles are of interest Classifying Web pages by topic  Classical approaches (non-Bayesian, statistical text learning algorithms) from information retrieval: TF-IDF (Term Frequency-Inverse Document Frequency) PRTFIDF (Probabilistic TFIDF)  Naïve Bayes Classifier  SVM 57

58 Statement of the Problem  Target function: doc → V E.g. the target concept is interesting, then V={0,1}; the target concept is NewsGroup title, then V={comp.graphics, sci-space, alt.atheism,…};  A natural way to represent a document is to denote it by a vector of words “Naïve” representation: one attribute per word position in Doc  Learning: Use training examples to estimate P(v j ) and P(doc|v j ) for every v j ∈ V. 58

59 Naïve Bayes Approach  Conditional independence assumption leads to  To simplify the above formula, assume that is independent of the position i (also called “bag of word” model) However, such simplification is questionable in cases e.g. the doc has certain structure and we want to utilize that during learning. 59

60 The Algorithm Learn_Naive_Bayes_Text (Examples, V) 1. Collect all words and other tokens that occur in Examples. Vocabulary ← all distinct words and other tokens in Examples. 2. Calculate the required P(v j ) and P(w k |v j ) terms For each target value v j in V do docs j ← subsets of Examples for which the target value is v j P(v j ) ← #docs j /#Examples Text j ← a single document created by concatenating all members of docs j n ← total number of distinct word positions inText j for each word w k in Vocabulary n k ← number of times word w k occurs in Text j P(w k |v j ) ← (n k +1)/(n+#Vocabulary) 60

61 The Algorithm (2)  Remark: In the formula for calculating P(w k |v j ), we assume that every distinct word equally likely appears in Examples, however, a more reasonable assumption is to replace 1/#Vocabulary by the frequency of word w k appearing in English literature *. Classify_Naive_Bayes_Text (Doc) Return the estimated target value for the document Doc.  positions ← all word positions in Doc that contain tokens found in Vocabulary  Returns v NB 61

62 Further Improvement to the Algorithm  Normalizing input documents first (removing trivial words, normalizing non-trivial words, etc)  Consider positions of words inside documents (such as layout of Web pages), temporal patterns of words, etc.  Consider correlation of words 62

63 Summary  Bayes Theorem  ML and MAP hypotheses  Minimum description principle (MDL)  Bayes Optimal Classifier  Bayesian learning framework  Naïve Bayes Classifier 63

64 Further Reading  Domingos and Pazzani (1996): conditions under which Naïve Bayes will output optimal classification  Cestnik (1990): m-estimate  Mitchie et al. (1994), Chauvin et al. (1995): Bayesian approach to other learning algorithms such as decision tree, neural network, etc.  On relation between Naïve Bayes and “Bag of Word”: Lewis, David (1998). "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval". Proceedings of ECML-98, 10th European Conference on Machine Learning. Chemnitz, DE: Springer Verlag, Heidelberg, DE. pp. 4–15. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval 64

65 HW  6.5(10pt, Due Monday, Oct 31) 65


Download ppt "机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel : 82529680  助教:程再兴, Tel : 62763742  课程网页:"

Similar presentations


Ads by Google