UNIT IV: UNCERTAIN KNOWLEDGE AND REASONING

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Dynamic Bayesian Networks (DBNs)
1 Knowledge Engineering for Bayesian Networks. 2 Probability theory for representing uncertainty l Assigns a numerical degree of belief between 0 and.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Introduction of Probabilistic Reasoning and Bayesian Networks
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Probabilistic Reasoning Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 14 (14.1, 14.2, 14.3, 14.4) Capturing uncertain knowledge Probabilistic.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Probabilistic Reasoning
Bayesian Networks Material used 1 Random variables
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Decision Making Under Uncertainty CMSC 471 – Spring 2014 Class #12– Thursday, March 6 R&N, Chapters , material from Lise Getoor, Jean-Claude.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
DEALING WITH UNCERTAINTY (1) WEEK 5 CHAPTER 3. Introduction The world is not a well-defined place. There is uncertainty in the facts we know: – What’s.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Anifuddin Azis UNCERTAINTY. 2 Introduction The world is not a well-defined place. There is uncertainty in the facts we know: What’s the temperature? Imprecise.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
A Brief Introduction to Bayesian networks
Reasoning Under Uncertainty: Belief Networks
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
Uncertainty Chapter 13.
CAP 5636 – Advanced Artificial Intelligence
Probabilistic Reasoning over Time
Probabilistic Reasoning over Time
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Professor Marie desJardins,
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Class #21 – Monday, November 10
CS 188: Artificial Intelligence Fall 2008
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

UNIT IV: UNCERTAIN KNOWLEDGE AND REASONING

Uncertain Knowledge and Reasoning Uncertainty Review of Probability Probabilistic Reasoning Bayesian networks Inferences in Bayesian networks Temporal Models Hidden Markov Models

Introduction The world is not a well-defined place. There is uncertainty in the facts we know: What’s the temperature? Imprecise measures Is Bush a good president? Imprecise definitions Where is the pit? Imprecise knowledge There is uncertainty in our inferences If I have a blistery, itchy rash and was gardening all weekend I probably have poison ivy People make successful decisions all the time anyhow.

Sources of Uncertainty Uncertain data missing data, unreliable, ambiguous, imprecise representation, inconsistent, subjective, derived from defaults, noisy… Uncertain knowledge Multiple causes lead to multiple effects Incomplete knowledge of causality in the domain Probabilistic/stochastic effects Uncertain knowledge representation restricted model of the real system limited expressiveness of the representation mechanism inference process Derived result is formally correct, but wrong in the real world New conclusions are not well-founded (eg, inductive reasoning) Incomplete, default reasoning methods

Reasoning Under Uncertainty So how do we do reasoning under uncertainty and with inexact knowledge? heuristics ways to mimic heuristic knowledge processing methods used by experts empirical associations experiential reasoning based on limited observations probabilities objective (frequency counting) subjective (human experience )

Decision making with uncertainty Rational behavior: For each possible action, identify the possible outcomes Compute the probability of each outcome Compute the utility of each outcome Compute the probability-weighted (expected) utility over possible outcomes for each action Select the action with the highest expected utility (principle of Maximum Expected Utility)

Probability theory Alarm, Burglary, Earthquake Random variables Domain Atomic event: complete specification of state Prior probability: degree of belief without any other evidence Joint probability: matrix of combined probabilities of a set of variables Alarm, Burglary, Earthquake Boolean (like these), discrete, continuous Alarm=True  Burglary=True  Earthquake=False alarm  burglary  earthquake P(Burglary) = .1 P(Alarm, Burglary) = alarm ¬alarm burglary .09 .01 ¬burglary .1 .8

Probability theory (cont.) Conditional probability: probability of effect given causes Computing conditional probs: P(a | b) = P(a  b) / P(b) P(b): normalizing constant Product rule: P(a  b) = P(a | b) P(b) P(burglary | alarm) = .47 P(alarm | burglary) = .9 P(burglary | alarm) = P(burglary  alarm) / P(alarm) = .09 / .19 = .47 P(burglary  alarm) = P(burglary | alarm) P(alarm) = .47 * .19 = .09

Independence When two sets of propositions do not affect each others’ probabilities- independent, and can easily compute their joint and conditional probability: Independent (A, B) if P(A  B) = P(A) P(B), P(A | B) = P(A) For example, {moon-phase, light-level} might be independent of {burglary, alarm, earthquake} We need a more complex notion of independence, and methods for reasoning about these kinds of relationships

P(b | a) = P(a | b) P(b) / P(a) Baye’s Rule P(b | a) = P(a | b) P(b) / P(a)

Bayes Example: Diagnosing Meningitis Suppose we know that Stiff neck is a symptom in 50% of meningitis cases Meningitis (m) occurs in 1/50,000 patients Stiff neck (s) occurs in 1/20 patients Then P(s|m) = 0.5, P(m) = 1/50000, P(s) = 1/20 P(m|s) = (P(s|m) P(m))/P(s) = (0.5 x 1/50000) / 1/20 = .0002 So we expect that one in 5000 patients with a stiff neck to have meningitis.

Conditional independence Absolute independence: A and B are independent if P(A  B) = P(A) P(B); equivalently, P(A) = P(A | B) and P(B) = P(B | A) A and B are conditionally independent given C if P(A  B | C) = P(A | C) P(B | C) This lets us decompose the joint distribution: P(A  B  C) = P(A | C) P(B | C) P(C) Moon-Phase and Burglary are conditionally independent given Light-Level Conditional independence is weaker than absolute independence, but still useful in decomposing the full joint probability distribution

Probabilistic Reasoning

Outline Introducing Bayesian Networks Constructing Bayesian Networks Representing Bayesian Networks Inference in Bayesian Networks

Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: a set of nodes, one per variable a directed, acyclic graph (link ≈ "directly influences") a conditional distribution for each node given its parents: P (Xi | Parents (Xi)) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution of Xi for each combination of parent values

Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity

Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Example contd.

Semantics The full joint distribution is defined as the product of the local conditional distributions: P (X1, … ,Xn) = πi = 1 P (Xi | Parents(Xi)) e.g., P(j  m  a  b  e) = P (j | a) P (m | a) P (a | b, e) P (b) P (e)

Constructing Bayesian networks 1. Choose an ordering of variables X1, … ,Xn 2. For i = 1 to n add Xi to the network select parents from X1, … ,Xi-1 such that P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1) Parents are the variables that ‘directly influence’ Xi This choice of parents guarantees: (chain rule) (by construction) The ordering of variables is crucial Causal models generally give good orderings If the ordering is chosen wrongly, typically the BN will be more complex than necessary

Conditional Independence in Bayesian Networks Conditional upon its parents, a node is independent of all other nodes in the network except its descendants

Independence in Bayesian Networks The Markov Blanket of a node consists of its parents, its children, and the other parents of those children Conditional upon its Markov Blanket, a node is independent of all other nodes

Representing Bayesian Networks Representing the dependency relationship graph is relatively straightforward Use any standard graph representation Representing the form of the dependencies is less obvious If there are k parents, the Contitional Probability Table size is 2k More compact ways of representing the CPT are highly desirable

Inference in Bayesian Networks In theory, the conditional probability of some output query of a Bayesian network can be computed from the inputs Using classical probabilistic arithmetic Unfortunately, the time complexity is O(2n) In fact, the time complexity of any exact solution for arbitrary Bayesian networks must be O(2n) Because Boolean satisfaction is a special case As with Boolean logic, some special networks are faster A Polytree has at most one path between any pair of nodes Exact probabilistic inference in polytrees can be computed in linear time

Exact Inference vs Sampling In general, exact inference in Bayesian networks is too expensive The alternative is to use Monte Carlo (sampling) methods

Direct Sampling Sampling is relatively straightforward when there is no evidence relating to the network Generation of samples from a known probability distribution. Using a simple non-deterministic algorithm: Sample from any of the nodes without parents according to their distributions Sample from any of the children, conditional upon the sample results already obtained for the parents As the number of samples increases, the sampled frequency of each event converges toward its expected value

Direct Sampling Example

Direct Sampling Example

Direct Sampling Example

Direct Sampling Example

Direct Sampling Example

Direct Sampling Example

Direct Sampling Example

Direct Sampling with Evidence The simplest approach is to use direct sampling, but reject all samples that conflict with the evidence However the proportion of successful samples is proportional to the probability of the evidence The probability of the evidence decreases exponentially with the number of evidence variables Method is unusable with a significant number of evidence variables

Likelihood weighting An alternative is to sample as before, except that the evidence variables are not sampled The probability of the evidence, given the other variables, is computed instead This probability is used to weight the sample More efficient than rejection sampling, because all samples are used May still be slow, if the evidence is unlikely (because the weight of each sample is low)

Likelihood Weighting Example

Likelihood Weighting Example

Likelihood Weighting Example

Likelihood Weighting Example

Likelihood Weighting Example

Likelihood Weighting Example

Likelihood Weighting Example

Markov Chain Monte Carlo Prev 2 alg- generate event from scratch. MCMC- generate event by making random change to preceding event. Algorithm: Generate an initial sample Perturb the initial sample by randomly sampling one of the non- evidence variables, conditional upon its Markov blanket Repeat Sample frequency converges in the limit to the posterior distribution (under fairly weak assumptions)

MCMC Example With evidence ‘Sprinkler, WetGrass’, there are four states

MCMC Example Algorithm is essentially “wander about a bit until the probability estimates stabilise” Markov Blankets For Cloudy Sprinkler, Rain For Rain Cloudy, Sprinkler, WetGrass

Summary Introducing Bayesian Networks Constructing Bayesian Networks Representing Bayesian Networks Exact inference Polynomial time on polytrees, NP-hard on general graphs very sensitive to topology Approximate inference by Likelihood weighting poor when there is much evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables

Temporal Models Hidden Markov Models

Temporal Probabilistic Agent environment agent ? sensors actuators t1, t2, t3, …

Time and uncertainty The world changes; we need to track and predict it Probabilistic reasoning for dynamic world. Repairing car-diagnosis Vs treating diabetic patient Basic idea: copy state and evidence variables for each time step

States and Observations Process of change is viewed as series of snapshots, each describing the state of the world at a particular time Each time slice involves a set or random variables indexed by t: the set of unobservable state variables Xt the set of observable evidence variable Et The observation at time t is Et = et for some set of values et The notation Xa:b denotes the set of variables from Xa to Xb

Markov processes (Markov chains)

Simplifying assumptions and notations States are our “events”. (Partial) states can be measured at reasonable time intervals. Xt unobservable state variables at t. Et (“evidence”) observable state variables at t. Vm:n : Variables Vm, Vm+1,…,Vn

Stationary, Markovian (transition model) Stationary: the laws of probability don’t change over time Markovian: current unobservalbe state depends on a finite number of past states First-order: current state depends only on the previous state, i.e.: P(Xt|X0:t-1)=P(Xt|Xt-1) Second-order: etc., etc.

Observable variables (the sensor model) Observable variables depend only on the current state (by definition, essentially), these are the “sensors”. The current state causes the sensor values. P(Et|X0:t,E0:t-1)=P(Et|Xt)

Start it up (the prior probability model) What is P(X0)? Given: Transition model: P(Xt|Xt-1) Sensor model: P(Et|Xt) Prior probability: P(X0) Then we can specify complete joint distribution: At time t, the joint is completely determined: P(X0,X1,…Xt,E1,…,Et) = P(X0) • ∏i  t P(Xi|Xi-1)P(Ei|Xi)

Inference tasks

Inference Tasks Filtering or monitoring: P(Xt|e1,…,et) computing current belief state, given all evidence to date What is the probability that it is raining today, given all the umbrella observations up through today? Prediction: P(Xt+k|e1,…,et) computing prob. of some future state What is the probability that it will rain the day after tomorrow, given all the umbrella observations up through today? Smoothing: P(Xk|e1,…,et) computing prob. of past state (hindsight) What is the probability that it rained yesterday, given all the umbrella observations through today? Most likely explanation: arg maxx1,..xtP(x1,…,xt|e1,…,et) given sequence of observation, find sequence of states that is most likely to have generated those observations.

Filtering We use recursive estimation to compute P(Xt+1 | e1:t+1) as a function of et+1 and P(Xt | e1:t) This leads to a recursive definition f1:t+1 = FORWARD(f1:t:t,et+1)

Filtering example

Smoothing Compute P(Xk|e1:t) for 0<= k < t Using a backward message bk+1:t = P(Ek+1:t | Xk), we obtain P(Xk|e1:t) = f1:kbk+1:t This leads to a recursive definition Bk+1:t = BACKWARD(bk+2:t,ek+1:t)

Smoothing

Smoothing example

Most likely explanation

Viterbi example

Markov Models Like the Bayesian network, a Markov model is a graph composed of states that represent the state of a process edges that indicate how to move from one state to another where edge is annotated with a probability indicating the likelihood of taking that transition Unlike the Bayesian network, the Markov model’s nodes are meant to convey temporal states An ordinary Markov model contains states that are observable so that the transition probabilities are the only mechanism that determines the state transitions We will find a more useful version of the Markov model to be the hidden Markov model

HMM Most interesting AI problems cannot be solved by a Markov model because there are unknown states in our real world problems in speech recognition, we can build a Markov model to predict the next word in an utterance by using the probabilities of how often any given word follows another how often does “lamb” follow “little”? A hidden Markov model (HMM) is a Markov model where the probabilities are actually probabilistic functions that are based in part on the current state, which is hidden (unknown or unobservable) determining which transition to take will require additional knowledge than merely the state transition probabilities

Example: Speech Recognition We have observations, the acoustic signal But hidden from us is intention that created the signal For instance, at time t1, we know what the signal looks like in terms of data, but we don’t know what the intended sound was (the phoneme or letter or word) The goal in speech recognition is to identify the actual utterance (in terms of phonetic units or words) but the phonemes/words are hidden to us We add to our model hidden (unobservable) states and appropriate probabilities for transitions the observables are not states in our network, but transition links the hidden states are the elements of the utterance (e.g., phonemes), which is what we are trying to identify we must search the HMM to determine what hidden state sequence best represents the input utterance

Hidden Markov models Simplest Dynamic Bayesian Network – HMM One discrete hidden node one discrete/continuous observed node per slice Possible values of hidden var- possible states of world- eg- Raint if more variables – combine all var into one Megavariable