Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Networks Martin Bachler MLA - VO 06.12.2005.

Similar presentations


Presentation on theme: "Bayesian Networks Martin Bachler MLA - VO 06.12.2005."— Presentation transcript:

1 Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

2 2

3 3 Overview „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“ (Bill Gates, quoted in LA Times, 1996)

4 4 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

5 5 Definitions Conditional probability Bayes theorem

6 6 Definitions Bayes theorem Likelihood Prior probability normalization term

7 7 Definitions Classification problem –Input space X={x 1 x x 2 x … x x n } –Output space Y = {0,1} –Target concept C:X→Y –Hypothesis space H Bayesian way of classifying an instance :

8 8 Definitions Theoretically OPTIMAL! For large n the estimation of is very hard! => Assumption: pairwise conditional independence between input-variables given C:

9 9 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

10 10 Naive Bayes

11 11 Example 1/41100 ……………… 2/3 1 P(x 2 |C)C 1/3 0 2/3 P(x 1 |C) 3/4 1/4 3/4 P(C) 10 01 11 x2x2 x1x1 0 1 1010

12 12 Naive Bayes - Independence The independence assumption is very strict! For most practical problems it is blatantly wrong! (not even fulfilled in the previous example!...see later) => Is naive Bayes a rather „academic“ algorithm ?

13 13 Naive Bayes - Independence For which problems is naive Bayes optimal ? (Lets assume for the moment we can perfectly estimate all necessary probabilites) Guess: For problems for which the independence assumption holds Let‘s check… (empirically + theoretically)

14 14 Independence - Example 1111000 1/3 1/90100 0100010 2/31/31/91/3110 1000001 2/32/91/3101 0000011 2/3 4/91/3111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x1 0 1 1010

15 15 Independence - Example

16 16 Independence - Example 1/2 1/41/2000 1/40100 1/2 1/40010 1/2 1/41/2110 1/40001 1/2 1/41/2101 1/41/2011 1/40111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x1 0 1 1010

17 17 Independence - Example

18 18 Naive Bayes - Independence [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

19 19 Naive Bayes - Independence

20 20 Naive Bayes - Independence For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer ?

21 21 Naive Bayes - optimality Example: 3 features x 1, x 2, x 3 P(c=0) = P(c=1) x1, x3 independent; x2 = x1 (totally dep.) => optimal classification: naive Bayes: [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

22 22 Naive Bayes - optimality Let p =P(1|x 1 ), q = P(1|x 3 ) optimal: naive Bayes: independence assumption holds optimal and naive classifier disagree only here

23 23 Naive Bayes - optimality In general: Instance x = Let Theorem 1: A naive Bayesian classifier is optimal for x, iff

24 24 Naive Bayes - optimality region of optimality independence assumption holds only here

25 25 Naive Bayes - optimality This is a criterion for local optimality ( instance) What about global optimality ? Theorem 2: The naive Bayesian classifier is globally optimal for a dataset Ѕ iff

26 26 Naive Bayes - optimality What is the reason for this ? –Difference between classification and probability (distribution) estimation –I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability. Problem with this result: Verification of global optimality (optimality for all instances) ?

27 27 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds.

28 28 Naive Bayes - linearity other question: how does naive Bayes‘ hypothesis depend on the input variables ? Consider simple case of binary variables only… It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!! [2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973

29 29 Naive Bayes - linearity Proof…

30 30 Naive Bayes – linearity - examples naive Bayes Perceptron

31 31 Naive Bayes – linearity - examples

32 32 Naive Bayes - linearity For boolean domains naive Bayes‘ hypothesis is a linear hyperplane! => It can only be globally optimal for linearly separable problems!! BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)

33 33 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds. Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.

34 34 Naive Bayes - optimality class of concepts for which perceptron is optimal class of concepts for which naive Bayes is optimal

35 35 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

36 36 Bayesian networks The problem-class for which naive Bayes is optimal is quite small…. Idea: Relax the independence-assumption to obtain a more general classifier I.e. model cond. dependencies between variables Different techniques (e.g. hidden variables,…) Most established: Bayesian networks

37 37 Bayesian networks Bayesian network: –tool for representing statistical dependencies between a set of random variables –acyclic directed graph –one vertex for each variable –for each pair of stat. dependent variables there is an edge in the graph between the corresponding vertices –not connected variables(vertices) are independent! –each vertex has a table of local probability distributions

38 38 Bayesian networks Each variable is dependent only on its parents in the network! y x1x1 x3x3 x2x2 x4x4 x5x5 „parents“ of x 4 (Pa 4 )

39 39 Bayesian networks Bayesian network – based classifier: y x1x1 x3x3 x2x2 x4x4 x5x5

40 40 Bayesian networks In the case of boolean attributes this is again linear, but not on the input-variables: Linear on product-features:

41 41 Bayesian networks The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data! For general Bayesian networks this problem is NP-hard! There exist numerous heuristics for learning Bayesian networks from data!

42 42 References [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996 [2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973


Download ppt "Bayesian Networks Martin Bachler MLA - VO 06.12.2005."

Similar presentations


Ads by Google