Download presentation
Presentation is loading. Please wait.
Published byCori Hubbard Modified over 9 years ago
1
Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005
2
2
3
3 Overview „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“ (Bill Gates, quoted in LA Times, 1996)
4
4 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
5
5 Definitions Conditional probability Bayes theorem
6
6 Definitions Bayes theorem Likelihood Prior probability normalization term
7
7 Definitions Classification problem –Input space X={x 1 x x 2 x … x x n } –Output space Y = {0,1} –Target concept C:X→Y –Hypothesis space H Bayesian way of classifying an instance :
8
8 Definitions Theoretically OPTIMAL! For large n the estimation of is very hard! => Assumption: pairwise conditional independence between input-variables given C:
9
9 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
10
10 Naive Bayes
11
11 Example 1/41100 ……………… 2/3 1 P(x 2 |C)C 1/3 0 2/3 P(x 1 |C) 3/4 1/4 3/4 P(C) 10 01 11 x2x2 x1x1 0 1 1010
12
12 Naive Bayes - Independence The independence assumption is very strict! For most practical problems it is blatantly wrong! (not even fulfilled in the previous example!...see later) => Is naive Bayes a rather „academic“ algorithm ?
13
13 Naive Bayes - Independence For which problems is naive Bayes optimal ? (Lets assume for the moment we can perfectly estimate all necessary probabilites) Guess: For problems for which the independence assumption holds Let‘s check… (empirically + theoretically)
14
14 Independence - Example 1111000 1/3 1/90100 0100010 2/31/31/91/3110 1000001 2/32/91/3101 0000011 2/3 4/91/3111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x1 0 1 1010
15
15 Independence - Example
16
16 Independence - Example 1/2 1/41/2000 1/40100 1/2 1/40010 1/2 1/41/2110 1/40001 1/2 1/41/2101 1/41/2011 1/40111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x1 0 1 1010
17
17 Independence - Example
18
18 Naive Bayes - Independence [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
19
19 Naive Bayes - Independence
20
20 Naive Bayes - Independence For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer ?
21
21 Naive Bayes - optimality Example: 3 features x 1, x 2, x 3 P(c=0) = P(c=1) x1, x3 independent; x2 = x1 (totally dep.) => optimal classification: naive Bayes: [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
22
22 Naive Bayes - optimality Let p =P(1|x 1 ), q = P(1|x 3 ) optimal: naive Bayes: independence assumption holds optimal and naive classifier disagree only here
23
23 Naive Bayes - optimality In general: Instance x = Let Theorem 1: A naive Bayesian classifier is optimal for x, iff
24
24 Naive Bayes - optimality region of optimality independence assumption holds only here
25
25 Naive Bayes - optimality This is a criterion for local optimality ( instance) What about global optimality ? Theorem 2: The naive Bayesian classifier is globally optimal for a dataset Ѕ iff
26
26 Naive Bayes - optimality What is the reason for this ? –Difference between classification and probability (distribution) estimation –I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability. Problem with this result: Verification of global optimality (optimality for all instances) ?
27
27 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds.
28
28 Naive Bayes - linearity other question: how does naive Bayes‘ hypothesis depend on the input variables ? Consider simple case of binary variables only… It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!! [2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973
29
29 Naive Bayes - linearity Proof…
30
30 Naive Bayes – linearity - examples naive Bayes Perceptron
31
31 Naive Bayes – linearity - examples
32
32 Naive Bayes - linearity For boolean domains naive Bayes‘ hypothesis is a linear hyperplane! => It can only be globally optimal for linearly separable problems!! BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)
33
33 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds. Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.
34
34 Naive Bayes - optimality class of concepts for which perceptron is optimal class of concepts for which naive Bayes is optimal
35
35 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks
36
36 Bayesian networks The problem-class for which naive Bayes is optimal is quite small…. Idea: Relax the independence-assumption to obtain a more general classifier I.e. model cond. dependencies between variables Different techniques (e.g. hidden variables,…) Most established: Bayesian networks
37
37 Bayesian networks Bayesian network: –tool for representing statistical dependencies between a set of random variables –acyclic directed graph –one vertex for each variable –for each pair of stat. dependent variables there is an edge in the graph between the corresponding vertices –not connected variables(vertices) are independent! –each vertex has a table of local probability distributions
38
38 Bayesian networks Each variable is dependent only on its parents in the network! y x1x1 x3x3 x2x2 x4x4 x5x5 „parents“ of x 4 (Pa 4 )
39
39 Bayesian networks Bayesian network – based classifier: y x1x1 x3x3 x2x2 x4x4 x5x5
40
40 Bayesian networks In the case of boolean attributes this is again linear, but not on the input-variables: Linear on product-features:
41
41 Bayesian networks The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data! For general Bayesian networks this problem is NP-hard! There exist numerous heuristics for learning Bayesian networks from data!
42
42 References [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996 [2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.