Hankz Hankui Zhuo http://plan.sysu.edu.cn Bayesian Networks Hankz Hankui Zhuo http://plan.sysu.edu.cn.

Slides:



Advertisements
Similar presentations
Probabilistic Reasoning Bayesian Belief Networks Constructing Bayesian Networks Representing Conditional Distributions Summary.
Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
Text Categorization CSC 575 Intelligent Information Retrieval.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Review: Bayesian learning and inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Text Classification – Naive Bayes
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
For Monday after Spring Break Read Homework: –Chapter 13, exercise 6 and 8 May be done in pairs.
Probabilistic Reasoning
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
1 CS 343: Artificial Intelligence Probabilistic Reasoning and Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Classification Techniques: Bayesian Classification
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
1 Text Categorization Slides based on R. Mooney (UT Austin)
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
1 Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
Review of Probability.
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Qian Liu CSE spring University of Pennsylvania
Web-Mining Agents Part: Data Mining
CS b553: Algorithms for Optimization and Learning
Computer Science Department
Read R&N Ch Next lecture: Read R&N
Approximate Inference
Learning Bayesian Network Models from Data
Bayesian Networks Probability In AI.
CSCI 121 Special Topics: Bayesian Networks Lecture #2: Bayes Nets
Classification Techniques: Bayesian Classification
Text Categorization.
Read R&N Ch Next lecture: Read R&N
Pattern Recognition and Image Analysis
CS 188: Artificial Intelligence Fall 2007
Class #19 – Tuesday, November 3
Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood
Class #16 – Tuesday, October 26
Belief Networks CS121 – Winter 2003 Belief Networks.
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Bayesian Networks CSE 573.
Probabilistic Reasoning
Presentation transcript:

Hankz Hankui Zhuo http://plan.sysu.edu.cn Bayesian Networks Hankz Hankui Zhuo http://plan.sysu.edu.cn

Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0. The probability of disjunction is: A B

Conditional Probability P(A | B) is the probability of A given B Assumes that B is the only information known. Defined by: A B

Independence A and B are independent iff: Therefore, if A and B are independent: These two constraints are logically equivalent

Joint Distribution The joint probability distribution for a set of random variables, X1,…,Xn: P(X1,…,Xn) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. positive negative circle square red 0.20 0.02 blue 0.01 circle square red 0.05 0.30 blue 0.20

Probabilistic Classification Let Y be the random variable for the class which takes values {y1,y2,…ym}. Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible value for X. For classification, we need to compute P(Y=yi | X=xk) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. Assuming Y and all Xi are binary, we need 2n entries to specify P(Y=pos | X=xk) for each of the 2n possible xk’s since P(Y=neg | X=xk) = 1 – P(Y=pos | X=xk)

Bayes Theorem Simple proof from definition of conditional probability: (Def. cond. prob.) (Def. cond. prob.) QED:

Bayesian Categorization Determine category of xk by determining for each yi P(X=xk) can be determined since categories are complete and disjoint.

Bayesian Categorization (cont.) Need to know: Priors: P(Y=yi) Conditionals: P(X=xk | Y=yi) P(Y=yi) are easily estimated from data. If ni of the examples in D are in yi then P(Y=yi) = ni / |D| Too many possible instances (e.g. 2n for binary features) to estimate all P(X=xk | Y=yi). Still need to make some sort of independence assumptions about the features to make learning tractable.

Naïve Bayesian Categorization If we assume features of an instance are independent given the category (conditionally independent). Therefore, we then only need to know P(Xi | Y) for each possible pair of a feature-value and a category. If Y and all Xi and binary, this requires specifying only 2n parameters: P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi P(Xi=false | Y) = 1 – P(Xi=true | Y) Compared to specifying 2n parameters without any independence assumptions.

<medium ,red, circle> Naïve Bayes Example Probability positive negative P(Y) 0.5 P(small | Y) 0.4 P(medium | Y) 0.1 0.2 P(large | Y) P(red | Y) 0.9 0.3 P(blue | Y) 0.05 P(green | Y) P(square | Y) P(triangle | Y) P(circle | Y) Test Instance: <medium ,red, circle>

Naïve Bayes Example Test Instance: <medium ,red, circle> Probability positive negative P(Y) 0.5 P(medium | Y) 0.1 0.2 P(red | Y) 0.9 0.3 P(circle | Y) Test Instance: <medium ,red, circle> P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X) 0.5 * 0.1 * 0.9 * 0.9 = 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181 P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) 0.5 * 0.2 * 0.3 * 0.3 = 0.009 / P(X) = 0.009 / 0.0495 = 0.1818 P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1 P(X) = (0.0405 + 0.009) = 0.0495

Estimating Probabilities Normally, probabilities are estimated based on observed frequencies in the training data. If D contains nk examples in category yk, and nijk of these nk examples have the jth value for feature Xi, xij, then: However, estimating such probabilities from small training sets is error-prone. If due only to chance, a rare feature, Xi, is always false in the training data, yk :P(Xi=true | Y=yk) = 0. If Xi=true then occurs in a test example, X, the result is that yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0

Probability Estimation Example positive negative P(Y) 0.5 P(small | Y) P(medium | Y) 0.0 P(large | Y) P(red | Y) 1.0 P(blue | Y) P(green | Y) P(square | Y) P(triangle | Y) P(circle | Y) Ex Size Color Shape Category 1 small red circle positive 2 large 3 triangle negitive 4 blue Test Instance X: <medium, red, circle> P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0

Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. For binary features, p is simply assumed to be 0.5.

Laplace Smothing Example Assume training set contains 10 positive examples: 4: small 0: medium 6: large Estimate parameters as follows (if m=1, p=1/3) P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576 P(small or medium or large | positive) = 1.0

Continuous Attributes If Xi is a continuous feature rather than a discrete one, need another way to calculate P(Xi | Y). Assume that Xi has a Gaussian distribution whose mean and variance depends on Y. During training, for each combination of a continuous feature Xi and a class value for Y, yk, estimate a mean, μik , and standard deviation σik based on the values of feature Xi in class yk in the training data. During testing, estimate P(Xi | Y=yk) for a given example, using the Gaussian distribution defined by μik and σik .

Bayesian Networks

Bayesian Networks Directed Acyclic Graph (DAG) Nodes are random variables Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls

Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). Roots (sources) of the DAG that have no parents are given prior probabilities. P(B) .001 P(E) .002 Burglary Earthquake B E P(A) T .95 F .94 .29 .001 Alarm A P(J) T .90 F .05 A P(M) T .70 F .01 JohnCalls MaryCalls

CPT Comments Probability of false not given since rows must add to 1. Example requires 10 parameters rather than 25–1=31 for specifying the full joint distribution. Number of parameters in the CPT for a node is exponential in the number of parents (fan-in).

Joint Distributions for Bayes Nets A Bayesian Network implicitly defines a joint distribution. Example Therefore an inefficient approach to inference is: 1) Compute the joint distribution using this equation. 2) Compute any desired conditional probability using the joint distribution.

Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y … X1 X2 Xn Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network.

Bayes Net Inference Given known values for some evidence variables, determine the posterior probability of some query variables. Example: Given that John calls, what is the probability that there is a Burglary? ??? John calls 90% of the time there is an Alarm and the Alarm detects 94% of Burglaries so people generally think it should be fairly high. However, this ignores the prior probability of John calling. Burglary Earthquake Alarm JohnCalls MaryCalls

The End!