Bayesian Networks Martin Bachler MLA - VO
2
3 Overview „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“ (Bill Gates, quoted in LA Times, 1996)
4 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
5 Definitions Conditional probability Bayes theorem
6 Definitions Bayes theorem Likelihood Prior probability normalization term
7 Definitions Classification problem –Input space X={x 1 x x 2 x … x x n } –Output space Y = {0,1} –Target concept C:X→Y –Hypothesis space H Bayesian way of classifying an instance :
8 Definitions Theoretically OPTIMAL! For large n the estimation of is very hard! => Assumption: pairwise conditional independence between input-variables given C:
9 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
10 Naive Bayes
11 Example 1/41100 ……………… 2/3 1 P(x 2 |C)C 1/3 0 2/3 P(x 1 |C) 3/4 1/4 3/4 P(C) x2x2 x1x
12 Naive Bayes - Independence The independence assumption is very strict! For most practical problems it is blatantly wrong! (not even fulfilled in the previous example!...see later) => Is naive Bayes a rather „academic“ algorithm ?
13 Naive Bayes - Independence For which problems is naive Bayes optimal ? (Lets assume for the moment we can perfectly estimate all necessary probabilites) Guess: For problems for which the independence assumption holds Let‘s check… (empirically + theoretically)
14 Independence - Example /3 1/ /31/31/91/ /32/91/ /3 4/91/3111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x
15 Independence - Example
16 Independence - Example 1/2 1/41/2000 1/ /2 1/ /2 1/41/2110 1/ /2 1/41/2101 1/41/2011 1/40111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x
17 Independence - Example
18 Naive Bayes - Independence [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
19 Naive Bayes - Independence
20 Naive Bayes - Independence For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer ?
21 Naive Bayes - optimality Example: 3 features x 1, x 2, x 3 P(c=0) = P(c=1) x1, x3 independent; x2 = x1 (totally dep.) => optimal classification: naive Bayes: [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
22 Naive Bayes - optimality Let p =P(1|x 1 ), q = P(1|x 3 ) optimal: naive Bayes: independence assumption holds optimal and naive classifier disagree only here
23 Naive Bayes - optimality In general: Instance x = Let Theorem 1: A naive Bayesian classifier is optimal for x, iff
24 Naive Bayes - optimality region of optimality independence assumption holds only here
25 Naive Bayes - optimality This is a criterion for local optimality ( instance) What about global optimality ? Theorem 2: The naive Bayesian classifier is globally optimal for a dataset Ѕ iff
26 Naive Bayes - optimality What is the reason for this ? –Difference between classification and probability (distribution) estimation –I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability. Problem with this result: Verification of global optimality (optimality for all instances) ?
27 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds.
28 Naive Bayes - linearity other question: how does naive Bayes‘ hypothesis depend on the input variables ? Consider simple case of binary variables only… It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!! [2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973
29 Naive Bayes - linearity Proof…
30 Naive Bayes – linearity - examples naive Bayes Perceptron
31 Naive Bayes – linearity - examples
32 Naive Bayes - linearity For boolean domains naive Bayes‘ hypothesis is a linear hyperplane! => It can only be globally optimal for linearly separable problems!! BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)
33 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds. Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.
34 Naive Bayes - optimality class of concepts for which perceptron is optimal class of concepts for which naive Bayes is optimal
35 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
36 Bayesian networks The problem-class for which naive Bayes is optimal is quite small…. Idea: Relax the independence-assumption to obtain a more general classifier I.e. model cond. dependencies between variables Different techniques (e.g. hidden variables,…) Most established: Bayesian networks
37 Bayesian networks Bayesian network: –tool for representing statistical dependencies between a set of random variables –acyclic directed graph –one vertex for each variable –for each pair of stat. dependent variables there is an edge in the graph between the corresponding vertices –not connected variables(vertices) are independent! –each vertex has a table of local probability distributions
38 Bayesian networks Each variable is dependent only on its parents in the network! y x1x1 x3x3 x2x2 x4x4 x5x5 „parents“ of x 4 (Pa 4 )
39 Bayesian networks Bayesian network – based classifier: y x1x1 x3x3 x2x2 x4x4 x5x5
40 Bayesian networks In the case of boolean attributes this is again linear, but not on the input-variables: Linear on product-features:
41 Bayesian networks The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data! For general Bayesian networks this problem is NP-hard! There exist numerous heuristics for learning Bayesian networks from data!
42 References [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996 [2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973