Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Homework 3: Naive Bayes Classification
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Bayesian Networks Textbook: Probabilistic Reasoning, Sections 1-2, pp
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Bayesian networks Chapter 14. Outline Syntax Semantics.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
INC 551 Artificial Intelligence Lecture 8 Models of Uncertainty.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Slides for “Data Mining” by I. H. Witten and E. Frank.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Probabilistic Reasoning Inference and Relational Bayesian Networks.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Reasoning Under Uncertainty: Belief Networks
Bayesian networks Chapter 14 Section 1 – 2.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesianism vs. Frequentism Classical probability: Frequentists –Probability of a particular event is defined relative to its frequency in a sample space of events. –E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses.

Bayesian probability: –Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. Example: Bayesian can assign probability to statement “The first message ever written was not spam” but frequentist cannot.

Bayesian Knowledge Representation and Reasoning Question: Given the data D and our prior beliefs, what is the probability that h is the correct hypothesis? (spam example)

Bayesian terminology (example -- spam recognition) –Random variable X: returns one of a set of values {x 1, x 2,..., x m }, or a continuous value in interval [a,b] with probability distribution D(X). –Data D: {v 1, v 2, v 3,...} Set of observed values of random variables X 1, X 2, X 3,...

–Hypothesis h: Function taking instance j and returning classification of j (e.g., “spam” or “not spam”). –Space of hypotheses H: Set of all possible hypotheses

–Prior probability of h: P(h): Probability that hypothesis h is true given our prior knowledge If no prior knowledge, all h  H are equally probable –Posterior probability of h: P(h|D): Probability that hypothesis h is true, given the data D. –Likelihood of D: P(D|h): Probability that we will see data D, given hypothesis h is true.

Recall definition of conditional probability: X Y event space Event space = all messages X = all spam messages Y = all messages containing word “v1agra”

Bayes Rule: X Y event space

Example: Using Bayes Rule Hypotheses: h = “message m is spam”  h = “message m is not spam” Data: + = message m contains “viagra” – = message m does not contain “viagra” Prior probability: P(h) = 0.1P(  h) = 0.9 Likelihood: P(+ | h) = 0.6 P(– | h) = 0.4 P(+ |  h) = 0.03, P(– |  h) = 0.97

P(+) = P(+ | h) P(h) + P(+ |  h)P(  h) = 0.6 * *.9 = 0.09 P(–) = 0.91 P(h | +) = P(+ | h) P(h) / P(+) = 0.6 * 0.1 /.09 =.67 How would we learn these prior probabilities and likelihoods from past examples of spam and not spam?

Full joint probability distribution (CORRECTED) “viagra”  “viagra” Spam  Spam Notation: P(h,D)  P(h  D) P (h  +) = P(h | +) P(+) P(h  -) = P(h | -) P(-) etc.

Now suppose there is a second feature examined: does message contain the word “offer”? viagra  “viagra” “viagra”  “viagra” spam  spam offer  offer Full joint distribution scales exponentially with number of parameters P(m=spam, viagra, offer)

Bayes optimal classifier for spam: where f i is a feature (here, could be a “keyword”) In general, intractable.

Classification using “naive Bayes” Assumes that all features are independent of one another. How do we learn the naive Bayes model from data? How do we apply the naive Bayes model to a new instance?

Example: Training and Using Naive Bayes for Classification Features: –CAPS: Boolean (longest contiguous string of capitalized letters in message is longer than 3) –URL: Boolean (0 if no URL in message, 1 if at least one URL in message) –$: Boolean (0 if $ does not appear at least once in message; 1 otherwise)

Training data: M 1 : “DON’T MISS THIS AMAZING OFFER $$$!” spam M 2 : “Dear mm, for more $$, check this out: spam M 3 : “I plan to offer two sections of CS 250 next year” not spam M 4 : “Hi Mom, I am a bit short on $$ right now, can you not spam send some ASAP? Love, me”

Training a Naive Bayes Classifier Two hypotheses: spam or not spam Estimate: P(spam) =.5P(  spam) =.5 P(CAPS | spam) =.5P(  CAPS | spam) =.5 P(URL | spam) =.5P(  URL | spam) =.5 P($ | spam)=.75P(  $ | spam) =.25 P(CAPS |  spam )=.5P(  CAPS |  spam) =.5 P(URL |  spam) =.25P(  URL |  spam) =.75 P($ |  spam) =.5P(  $ |  spam) =.5

m-estimate of probability ( to fix cases where one of the terms in the product is 0):

Now classify new message: M 4 : “This is a ONE-TIME-ONLY offer that will get you BIG $$$, just click on

Information Retrieval Most important concepts: –Defining features of a document –Indexing documents according to features –Retrieving documents in response to a query –Ordering retrieved documents by relevance Early search engines: –Features: List of all terms (keywords) in document (minus “a”, “the”, etc.) –Indexing: by keyword –Retrieval: by keyword match with query –Ordering: by number of keywords matched Problems with this approach

Naive Bayesian Document retrieval Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is relevant to Q. In document retreival, we want to compute: Or, “odds ratio”: In the book, they show (via a lot of algebra) that Chain rule: P(A,B) = p(A|B) p(B)

Naive Bayesian Document retrieval Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is relevant to Q. In document retreival, we want to compute: Or, “odds ratio”: In the book, they show (via a lot of algebra) that Chain rule: P(A,B) = p(A|B) p(B)

Naive Bayesian Document retrieval Where Q j is the jth keyword in the query. The probability of a query given a relevant document D is estimated as the product of the probabilities of each keyword in the query, given the relevant document. How to learn these probabilities?

Evaluating Information Retrieval Systems Precision and Recall Example: Out of corpus of 100 documents, query has following results: Precision: Fraction of relevant documents in results set = 30/40 =.75 “How precise is results set?” Recall: Fraction of relevant documents in whole corpus that are in results set = 30/50 =.60 “How many relevant documents were recalled?” In results setNot in results set Relevant3020 Not relevant1040

Tradeoff between recall and precision: If we want to ensure that recall is high, just recall a lot of documents. Then precision may be low. If we recall 100% of documents, but only 50% are relevant, then recall is 1, but precision is 0.5. If we want high chance for precision to be high, just recall the single document judged most relevant (“I’m feeling lucky” in Google.) Then precision will (likely) be 1.0, but recall will be low. When do you want high precision? When do you want high recall?

Bayesian approaches to knowledge representation and reasoning Part 2 (Chapter 14, sections 1-4)

Recall Naive Bayes method: This can also be written in terms of “cause” and “effect”:

Spam v1agra stock offer cause effects Naive Bayes Bayesian network Spam v1agra stock offer

Each node has a “conditional probability table” that gives its dependencies on its parents. Spam v1agra stock offer P(Spam) 0.1 SpamP(v1agra) t0.6 f0.03 SpamP(stock) t0.2 f0.3 stockP(offer) t0.6 f0.1

Semantics of Bayesian networks If network is correct, can calculate full joint probability distribution from network. where parents(X i ) denotes specific values of parents of X i. offer  offer offer  offer Spam  Spam stock  stock Sum of all boxes is 1.

Example from textbook I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: –A burglar can set the alarm off –An earthquake can set the alarm off –The alarm can cause Mary to call –The alarm can cause John to call

Example continued

Complexity of Bayesian Networks For n random Boolean variables: –Full joint probability distribution: 2 n entries –Bayesian network with at most k parents per node: Each conditional probability table: at most 2 k entries Entire network: n 2 k entries

Exact inference in Bayesian networks Query: What is P(Burglary | JohnCalls=true ^ MaryCalls = true)? Notation: Capital letters are distributions; lower case letters are values or variables, depending on context. We have:

Let’s calculate this for b = “Burglary = true”: Worse case complexity: O(n 2 n ), where n is number of Boolean variables. We can simplify:

A. Onisko et al., A Bayesian network model for diagnosis of liver disorders

Can speed up further via “variable elimination”. However, bottom line on exact inference: In general, it’s intractable. (Exponential in n.) Solution: Approximate inference, by sampling.

Bayesian approaches to knowledge representation and reasoning Part 3 (Chapter 14, section 5)

What are the advantages of Bayesian networks? Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. Represents “beliefs and knowledge” about a particular class of situations. Efficient (?) (approximate) inference algorithms Efficient, effective learning algorithms

Review of exact inference in Bayesian networks General question: What is P(x|e)? Example Question: What is P(c| r,w)?

General question: What is P(x|e)?

Event space

Cloudy

Event space Cloudy Rain

Event space Cloudy Sprinkler Rain

Event space Cloudy Sprinkler Wet Grass Rain

Event space Cloudy Sprinkler Wet Grass Rain

Event space Cloudy Sprinkler Wet Grass Rain

Draw expression tree for Worst-case complexity is exponential in n (number of nodes) Problem is having to enumerate all possibilities for many variables.

Issues in Bayesian Networks Building / learning network topology Assigning / learning conditional probability tables Approximate inference via sampling

Real-World Example 1: The Lumière Project at Microsoft Research Bayesian network approach to answering user queries about Microsoft Office. “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.” “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Networks were developed by experts from user modeling studies.

Offspring of project was Office Assistant in Office 97.

Real-World Example 2: Diagnosing liver disorders with Bayesian networks Variables: “disorder class” (16 possibilities) plus 93 features from existing database of patient records. Data: 600 patient records, which used those features Network structure: designed by “domain experts” (30 hours)

A. Onisko et al., A Bayesian network model for diagnosis of liver disorders

Prior and conditional probability distributions were learned from data in liver-disorders database. Problem: Data doesn’t give enough samples for good conditional probability estimates. For combinations of parent values that are not adequately sampled, assume uniform distribution over those values.

Results “number of observations” = number of evidence variables in query window = n means that classification is counted as correct if it is in the n most probable diagnoses given by the network for the given evidence values.

Approximate inference in Bayesian networks Instead of enumerating all possibilities, sample to estimate probabilities. X 1 X 2 X 3 X n...

Direct Sampling Suppose we have no evidence, but we want to determine P(c,s,r,w) for all c,s,r,w. Direct sampling: –Sample each variable in topological order, conditioned on values of parents. –I.e., always sample from P(X i | parents(X i ))

1.Sample from P(Cloudy). Suppose returns true. 2.Sample from P(Sprinkler | Cloudy = true). Suppose returns false. 3.Sample from P(Rain | Cloudy = true). Suppose returns true. 4.Sample from P(WetGrass | Sprinkler = false, Rain = true). Suppose returns true. Here is the sampled event: [true, false, true, true] Example

Suppose there are N total samples, and let N S (x 1,..., x n ) be the observed frequency of the specific event x 1,..., x n. Suppose N samples, n nodes. Complexity O(Nn). Problem 1: Need lots of samples to get good probability estimates. Problem 2: Many samples are not realistic; low likelihood.

Likelihood weighting Now suppose we have evidence e. Thus values for the evidence variables E are fixed. We want to estimate P(X | e) Need to sample X and Y, where Y is the set of non- evidence variables. Each event sampled is weighted by the likelihood that that event accords to the evidence. I.e., events in which the actual evidence appears unlikely should be given less weight.

Example: Estimate P(Rain | Sprinkler = true, WetGrass = true). WeightedSample algorithm: 1.Set weight w = Sample from Cloudy. Suppose it returns true. 3.Sprinkler is an evidence variable with value true. Update likelihood weighting: Low likelihood for sprinkler if cloudy is true, so this sample gets lower weight.

4.Sample from P(Rain | Cloudy = true). Suppose this returns true. 5.WetGrass is an evidence variable with value true. Update likelihood weighting: 6.Return event [true, true, true, true] with weight Weight is low because cloudy = true, so sprinkler is unlikely to be true.

Problem with likelihood sampling As number of evidence variables increases, performance degrades. This is because most samples will have very low weights, so weighted estimate will be dominated by fraction of samples that accord more than an infinitesimal likelihood to the evidence.

Markov Chain Monte Carlo Sampling One of most common methods used in real applications. Uses idea of “Markov blanket” of a variable X i : – parents, children, children’s parents Recall that: By construction of Bayesian network, a node is conditionaly independent of its non-descendants, given its parents.

Proposition: A node X i is conditionally independent of all other nodes in the network, given its Markov blanket. –Example. –Need to show that X i is conditionally independent of nodes outside its Markov blanket. –Need to show that X i can be conditionally dependent on children’s parents.

A C B E F Example: The proposition says: B is conditionally independent of F given A, C, E. This can only be true if P(B | A,C,E,F) = P(B | A, C, E)

Prove: We know, by definition of conditional probability: From tree we have:

Thus:

Now compute P(B | A, C, E): Thus: Q.E.D.

Markov Chain Monte Carlo Sampling Start with random sample from variables: (x 1,..., x n ). This is the current “state” of the algorithm. Next state: Randomly sample value for one non-evidence variable X i, conditioned on current values in “Markov Blanket” of X i.

Example Query: What is P(Rain | Sprinkler = true, WetGrass = true)? MCMC: –Random sample, with evidence variables fixed: [true, true, false, true] –Repeat: 1.Sample Cloudy, given current values of its Markov blanket: Sprinkler = true, Rain = false. Suppose result is false. New state: [false, true, false, true] 2.Sample Rain, given current values of its Markov blanket: Cloudy = false, Sprinkler = true, WetGrass = true. Suppose result is true. New state: [false, true, true, true].

Each sample contributes to estimate for query P(Rain | Sprinkler = true, WetGrass = true) Suppose we perform 100 such samples, 20 with Rain = true and 80 with Rain = false. Then answer to the query is Normalize (  20,80  ) = .20,.80  Claim: “The sampling process settles into a dynamic equilibrium in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability, given the evidence.” Proof of claim is on pp

Claim (again) Claim: MCMC settles into behavior in which each state is sampled exactly according to its posterior probability, given the evidence. That is: for all variables X i, the probability of the value x i of X i appearing in a sample is equal to P(x i | e).

Proof of Claim (outline) First, give example of Markov chain. Now: Let x be a state, with x = (x 1,..., x n ). Let q (x  x) be the transition probability from state x to state x. Let  t (x) be the probability that the system will be in state x after t time steps, starting from state x 0. Let  t+1 (x’) be the probability that the system will be in state x after t+1 time steps, starting from state x 0.

We have:

Definition: Result from Markov chain theory: Given q, there is exactly one such stationary distribution  (assuming q is “ergodic”).  is called the Markov process’s stationary distribution if  t =  t+1 for all x. Defining equation for stationary distribution:

One way to satisfy equation 1 is: Called property of detailed balance. Detailed balance implies stationarity:

Proof of claim: [Replace old version of this slide with this version] Show that transition probability q (x  x) defined by MCMC sampling satisfies detailed balance equation, with a stationary distribution equal to P(x | e). Let X i be the variable to be sampled. Let e be the values of the evidence variables and let Y be the other non-evidence variables. Current sample: x = vector(x i, y), with fixed evidence variable values e. We have, by definition of MCMC algorithm:

Now, show this transition probability produces detailed balance. We want to show:

Speech Recognition (Section 15.6) Task: Identify sequence of words uttered by speaker, given acoustic signal. Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc. Thus speech recognition is viewed as problem of probabilistic inference.

Speech Recognition So far, we’ve looked at probabilistic reasoning in static environments. Speech: Time sequence of “static environments”. –Let X be the “state variables” (i.e., set of non-evidence variables) describing the environment (e.g., Words said during time step t) –Let E be the set of evidence variables (e.g., S = features of acoustic signal).

–The E values and X joint probability distribution changes over time. t 1 : X 1, e 1 t 2 : X 2, e 2 etc.

At each t, we want to compute P(Words | S). We know from Bayes rule: P(S | Words), for all words, is a previously learned “acoustic model”. –E.g. For each word, probability distribution over phones, and for each phone, probability distribution over acoustic signals (which can vary in pitch, speed, volume). P(Words), for all words, is the “language model”, which specifies prior probability of each utterance. –E.g. “bigram model”: probability of each word following each other word.

Speech recognition typically makes three assumptions: 1.Process underlying change is itself “stationary” i.e., state transition probabilities don’t change 2.Current state X depends on only a finite history of previous states (“Markov assumption”). –Markov process of order n: Current state depends only on n previous states. 3.Values e t of evidence variables depend only on current state X t. (“Sensor model”)

Example: “I’m firsty, um, can I have something to dwink?”