Abduction, Uncertainty, and Probabilistic Reasoning

Slides:



Advertisements
Similar presentations
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Exact Inference in Bayes Nets
Probabilistic Reasoning Course 8
Dynamic Bayesian Networks (DBNs)
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Knowledge Representation and Reasoning
5/17/20151 Probabilistic Reasoning CIS 479/579 Bruce R. Maxim UM-Dearborn.
Introduction of Probabilistic Reasoning and Bayesian Networks
1 Abduction, Uncertainty, and Probabilistic Reasoning Yun Peng UMBC.
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
Bayesian network inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Probabilistic Reasoning Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 14 (14.1, 14.2, 14.3, 14.4) Capturing uncertain knowledge Probabilistic.
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Bayesian Networks Russell and Norvig: Chapter 14 CMCS424 Fall 2003 based on material from Jean-Claude Latombe, Daphne Koller and Nir Friedman.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
10/22  Homework 3 returned; solutions posted  Homework 4 socket opened  Project 3 assigned  Mid-term on Wednesday  (Optional) Review session Tuesday.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
1 Midterm Exam Mean: 72.7% Max: % Kernel Density Estimation.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.
Does Naïve Bayes always work?
Bayesian networks Chapter 14. Outline Syntax Semantics.
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
1 Abduction, Uncertainty, and Probabilistic Reasoning Chapters 13, 14, and more.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.
1 Abduction, Uncertainty, and Probabilistic Reasoning Chapter 15 and more.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
Introduction to Bayesian Networks
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Uncertainty Management in Rule-based Expert Systems
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor.
Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
Introduction on Graphic Models
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
Review of Probability.
Does Naïve Bayes always work?
Abduction, Uncertainty, and Probabilistic Reasoning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Professor Marie desJardins,
Abduction, Uncertainty, and Probabilistic Reasoning
Professor Marie desJardins,
Class #21 – Monday, November 10
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Class #22/23 – Wednesday, November 12 / Monday, November 17
Presentation transcript:

Abduction, Uncertainty, and Probabilistic Reasoning Chapters 13, 14, and more

Introduction Abduction is a reasoning process that tries to form plausible explanations for abnormal observations Abduction is distinct different from deduction and induction Abduction is inherently uncertain Uncertainty becomes an important issue in AI research Some major formalisms for representing and reasoning about uncertainty Mycin’s certainty factor (an early representative) Probability theory (esp. Bayesian networks) Dempster-Shafer theory Fuzzy logic Truth maintenance systems

Abduction Definition (Encyclopedia Britannica): reasoning that derives an explanatory hypothesis from a given set of facts The inference result is a hypothesis, which if true, could explain the occurrence of the given facts Examples Dendral, an expert system to construct 3D structure of chemical compounds Fact: mass spectrometer data of the compound and the chemical formula of the compound KB: chemistry, esp. strength of different types of bounds Reasoning: form a hypothetical 3D structure which meet the given chemical formula, and would most likely produce the given mass spectrum if subjected to electron beam bombardment

Medical diagnosis Facts: symptoms, lab test results, and other observed findings (called manifestations) KB: causal associations between diseases and manifestations Reasoning: one or more diseases whose presence would causally explain the occurrence of the given manifestations Many other reasoning processes (e.g., word sense disambiguation in natural language process, image understanding, detective’s work, etc.) can also been seen as abductive reasoning.

Comparing abduction, deduction and induction Deduction: major premise: All balls in the box are black minor premise: These balls are from the box conclusion: These balls are black Abduction: rule: All balls in the box are black observation: These balls are black explanation: These balls are from the box Induction: case: These balls are from the box hypothesized rule: All ball in the box are black A => B A --------- B A => B B ------------- Possibly A Whenever A then B but not vice versa ------------- Possibly A => B Induction: from specific cases to general rules Abduction and deduction: both from part of a specific case to other part of the case using general rules (in different ways)

Characteristics of abduction reasoning Reasoning results are hypotheses, not theorems (may be false even if rules and facts are true), e.g., misdiagnosis in medicine There may be multiple plausible hypotheses When given rules A => B and C => B, and fact B both A and C are plausible hypotheses Abduction is inherently uncertain Hypotheses can be ranked by their plausibility if that can be determined Reasoning is often a Hypothesize- and-test cycle hypothesize phase: postulate possible hypotheses, each of which could explain the given facts (or explain most of the important facts) test phase: test the plausibility of all or some of these hypotheses

Reasoning is non-monotonic One way to test a hypothesis H is to query if some thing that is currently unknown but can be predicted from H is actually true. If we also know A => D and C => E, then ask if D and E are true. If it turns out D is true and E is false, then hypothesis A becomes more plausible (support for A increased, support for C decreased) Alternative hypotheses compete with each other (Okam’s razor) Reasoning is non-monotonic Plausibility of hypotheses can increase/decrease as new facts are collected (deductive inference determines if a sentence is true but would never change its truth value) Some hypotheses may be discarded/defeated, and new ones may be formed when new observations are made

Source of Uncertainty Uncertain data (noise) Uncertain knowledge (e.g, causal relations) A disorder may cause any and all POSSIBLE manifestations in a specific case A manifestation can be caused by more than one POSSIBLE disorders Uncertain reasoning results Abduction and induction are inherently uncertain Default reasoning, even in deductive fashion, is uncertain Incomplete deductive inference may be uncertain

Probabilistic Inference Based on probability theory (especially Bayes’ theorem) Well established discipline about uncertain outcomes Empirical science like physics/chemistry, can be verified by experiments Probability theory is too rigid to apply directly in many knowledge-based applications Some assumptions have to be made to simplify the reality Different formalisms have been developed in which some aspects of the probability theory are changed/modified. We will briefly review the basics of probability theory before discussing different approaches to uncertainty The presentation uses diagnostic process (an abductive and evidential reasoning process) as an example

Probability of Events Sample space and events Sample space S: (e.g., all people in an area) Events E1  S: (e.g., all people having cough) E2  S: (e.g., all people having cold) Prior (marginal) probabilities of events P(E) = |E| / |S| (frequency interpretation) P(E) = 0.1 (subjective probability) 0 <= P(E) <= 1 for all events Two special events:  and S: P() = 0 and P(S) = 1.0 Boolean operators between events (to form compound events) Conjunctive (intersection): E1 ^ E2 ( E1  E2) Disjunctive (union): E1 v E2 ( E1  E2) Negation (complement): ~E (E = S – E) C

Probabilities of compound events P(~E) = 1 – P(E) because P(~E) + P(E) =1 P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2) But how to compute the joint probability P(E1 ^ E2)? Conditional probability (of E1, given E2) How likely E1 occurs in the subspace of E2 E ~E E2 E1 E1 ^ E2

Independence assumption Two events E1 and E2 are said to be independent of each other if (given E2 does not change the likelihood of E1) Computation can be simplified with independent events Mutually exclusive (ME) and exhaustive (EXH) set of events ME: EXH:

Bayes’ Theorem In the setting of diagnostic/evidential reasoning Know prior probability of hypothesis conditional probability Want to compute the posterior probability Bayes’ theorem (formula 1): If the purpose is to find which of the n hypotheses is more plausible given , then we can ignore the denominator and rank them use relative likelihood

can be computed from and , if we assume all hypotheses are ME and EXH Then we have another version of Bayes’ theorem: where , the sum of relative likelihood of all n hypotheses, is a normalization factor

Probabilistic Inference for simple diagnostic problems Knowledge base: Case input: Find the hypothesis with the highest posterior probability By Bayes’ theorem Assume all pieces of evidence are conditionally independent, given any hypothesis

The relative likelihood The absolute posterior probability Evidence accumulation (when new evidence discovered)

Assessment of Assumptions Assumption 1: hypotheses are mutually exclusive and exhaustive Single fault assumption (one and only hypothesis must true) Multi-faults do exist in individual cases Can be viewed as an approximation of situations where hypotheses are independent of each other and their prior probabilities are very small Assumption 2: pieces of evidence are conditionally independent of each other, given any hypothesis Manifestations themselves are not independent of each other, they are correlated by their common causes Reasonable under single fault assumption Not so when multi-faults are to be considered

Limitations of the simple Bayesian system Cannot handle well hypotheses of multiple disorders Suppose are independent of each other Consider a composite hypothesis How to compute the posterior probability (or relative likelihood) Using Bayes’ theorem

P(B|A, E) <<P(B|A) but this is a very unreasonable assumption Cannot handle causal chaining Ex. A: weather of the year B: cotton production of the year C: cotton price of next year Observed: A influences C The influence is not direct (A -> B -> C) P(C|B, A) = P(C|B): instantiation of B blocks influence of A on C Need a better representation and a better assumption E and B are independent But when A is given, they are (adversely) dependent because they become competitors to explain A P(B|A, E) <<P(B|A) E: earth quake B: burglar A: alarm set off

Bayesian Belief Networks (BNs) Definition: BN = (DAG, CPD) DAG: directed acyclic graph (BN’s structure) Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables) Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence) CPD: conditional probability distribution (BN’s parameters) Conditional probabilities at each node, usually stored as a table (conditional probability table, or CPT) Root nodes are a special case – no parents, so just use priors in CPD:

Example BN A B C D E Uppercase: variables (A, B, …) P(a) = 0.001 A B C D E P(c|a) = 0.2 P(c|a) = 0.005 P(b|a) = 0.3 P(b|a) = 0.001 P(e|c) = 0.4 P(e|c) = 0.002 P(d|b,c) = 0.1 P(d|b,c) = 0.01 P(d|b,c) = 0.01 P(d|b,c) = 0.00001 Uppercase: variables (A, B, …) Lowercase: values/states of variables (A has two states a and a) Note that we only specify P(a) etc., not P(¬a), since they have to add to one

Conditional independence and chaining Conditional independence assumption where q is any set of variables (nodes) other than and its successors blocks influence of other nodes on and its successors (q influences only through variables in ) With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPDs by chaining these CPDs: q

Chaining: Example A B C D E Computing the joint probability for all variables is easy: The joint distribution of all variables P(A, B, C, D, E) = P(E | A, B, C, D) P(A, B, C, D) by Bayes’ theorem = P(E | C) P(A, B, C, D) by cond. indep. assumption = P(E | C) P(D | A, B, C) P(A, B, C) = P(E | C) P(D | B, C) P(C | A, B) P(A, B) = P(E | C) P(D | B, C) P(C | A) P(B | A) P(A)

Topological semantics A node is conditionally independent of its non-descendants given its parents A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket) The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z A B C A B C A B C Chain: A and C are independent, given B Converging: B and C are independent, NOT given A Diverging: B and C are independent, given A

Inference tasks Simple queries: Computer posterior marginal P(Xi | E=e) E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false) Posteriors for ALL nonevidence nodes Priors for and/all nodes (E = ) Conjunctive queries: P(Xi, Xj | E=e) = P(Xi | E=e) P(Xj | Xi, E=e) Optimal decisions: Decision networks or influence diagrams include utility information and actions; probabilistic inference is required to find P(outcome | action, evidence) Value of information: Which evidence should we seek next? Sensitivity analysis: Which probability values are most critical? Explanation: Why do I need a new starter motor?

MAP problems (explanation) The solution provides a good explanation for your action This is an optimization problem

Approaches to inference Exact inference Enumeration Variable elimination Belief propagation in polytrees (singly connected BNs) Clustering / join tree algorithms Approximate inference Stochastic simulation / sampling methods Markov chain Monte Carlo methods Loopy propagation Mean field theory Simulated annealing Genetic algorithms Neural networks

Direct inference with BNs Instead of computing the joint, suppose we just want the probability for one variable Exact methods of computation: Enumeration Variable elimination Belief propagation (for polytree only) Join/junction trees: clustering closely associated nodes into a big node in JT

Inference by enumeration Add all of the terms (atomic event probabilities) from the full joint distribution If E are the evidence (observed) variables and Y are the other (unobserved) variables, excluding X, then the posterior distribution P(X|E=e) = α P(X, e) = α ∑yP(X, e, Y) Sum is over all possible instantiations of variables in Y Each P(X, e, Y) term can be computed using the chain rule Computationally expensive!

Example: Enumeration P(xi) = Σ πi P(xi | πi) P(πi) B C D E Example: Enumeration P(xi) = Σ πi P(xi | πi) P(πi) Suppose we want P(D), and only the value of E is given as true P (d|e) =  ΣABCP(a, b, c, d, e) =  ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) With simple iteration to compute this expression, there’s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C for all possible assignments of A and B))

Exercise: Enumeration p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart study .9 .7 study .5 .1 pass p(pass|…) smart smart prep prep fair .9 .7 .2 fair .1 Query: What is the probability that a student studied, given that they pass the exam?

Variable elimination Basically just enumeration, but with caching of local calculations Linear for polytrees Potentially exponential for multiply connected BNs Exact inference in Bayesian networks is NP-hard!

Variable elimination General idea: Write query in the form Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product

Variable elimination Example: ΣAΣBΣCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) 8 x 4 multiplications 8 x 2 + 4 x 2 + 2 = 26 multiplications Example: ΣAΣBΣCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) = ΣAΣBP(a)P(b|a)ΣCP(c|a) P(d|b,c) P(e|c) = ΣAP(a)ΣBP(b|a)ΣCP(c|a) P(d|b,c) P(e|c) for each state of A = a for each state of B = b compute fC(a, b) = ΣCP(c|a) P(d|b,c) P(e|c) compute fB(a) = ΣBP(b)fC(a, b) Compute result = ΣAP(a)fB(a) Here fC(a, b), fB(a) are called factors, which are vectors or matrices Variable C is summed out variable B is summed out

Variable elimination: Example Rain Sprinkler Cloudy WetGrass

f1(R,S) = ∑c P(R|S) P(S|C) P(C) Computing factors R S C P(R|C) P(S|C) P(C) P(R|C) P(S|C) P(C) T F R S f1(R,S) = ∑c P(R|S) P(S|C) P(C) T F

Variable elimination algorithm Let X1,…, Xm be an ordering on the non-query variables For i = m, …, 1 Leave in the summation for Xi only factors mentioning Xi Multiply the factors, getting a factor that contains a number for each value of the variables mentioned, including Xi Sum out Xi, getting a factor f that contains a number for each value of the variables mentioned, not including Xi Replace the multiplied factor in the summation

Complexity of variable elimination Suppose in one elimination step we compute This requires multiplications (for each value for x, y1, …, yk, we do m multiplications) and additions (for each value of y1, …, yk , we do |Val(X)| additions) ►Complexity is exponential in the number of variables in the intermediate factors ►Finding an optimal ordering is NP-hard

Exercise: Variable elimination p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart study .9 .7 study .5 .1 pass p(pass|…) smart smart prep prep fair .9 .7 .2 fair .1 Query: What is the probability that a student is smart, given that they pass the exam?

Belief Propagation Singly connected network, SCN (also known as polytree) there is at most one undirected path between any two nodes (i.e., the network is a tree if the direction of arcs are ignored) The influence of the instantiated variable (evidence) spreads to the rest of the network along the arcs The instantiated variable influences its predecessors and successors differently (using CPT along opposite directions) Computation is linear to the diameter of the network (the longest undirected path) Update belief (posterior) of every non-evidence node in one pass For multi-connected net: conditioning A B C D E = e F

Conditioning A B C D E Conditioning: Find the network’s smallest cutset S (a set of nodes whose removal renders the network singly connected) In this network, S = {A} or {B} or {C} or {D} For each instantiation of S, compute the belief update with the belief propagation algorithm Combine the results from all instantiations of S (each is weighted by P(S = s)) Computationally expensive (finding the smallest cutset is in general NP-hard, and the total number of possible instantiations of S is O(2|S|))

Junction Tree Convert a BN to a junction tree Moralization: add undirected edge between every pair of parents, then drop directions of all arc: Moralized Graph Triangulation: add an edge to any cycle of length > 3: Triangulated Graph A junction tree is a tree of cliques of the triangulated graph Cliques are connected by links A link stands for the set of all variables S shared by these two cliques Each clique has a CPT, constructed from CPT of variables in the original BN

Junction Tree Reasoning Since it is now a tree, polytree algorithm can be applied, but now two cliques exchange P(S), the distribution of S Complexity: O(n) steps, where n is the number of cliques Each step is expensive if cliques are large (CPT exponential to clique size) Construction of CPT of JT is expensive as well, but it needs to compute only once.

Approximate inference: Direct sampling Suppose you are given values for some subset of the variables, E, and want to infer values for unknown variables, Z Randomly generate a very large number of instantiations from the BN Generate instantiations for all variables – start at root variables and work your way “forward” in topological order Rejection sampling: Only keep those instantiations that are consistent with the values for E Use the frequency of values for Z to get estimated probabilities Accuracy of the results depends on the size of the sample (asymptotically approaches exact results) Very expensive

Exercise: Direct sampling p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart study .9 .7 study .5 .1 pass p(pass|…) smart smart prep prep fair .9 .7 .2 fair .1 Topological order = …? Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42

Likelihood weighting Idea: Don’t generate samples that need to be rejected in the first place! Sample only from the unknown variables Z Weight each sample according to the likelihood that it would occur, given the evidence E A weight w is associated with each sample (w initialized to 1) When a evidence node (say E = e) is selected for sampling, its parents are already sampled (say parents A and B are assigned state a and b) Modify w = w * P(e | a, b) based on E’s CPT

Markov chain Monte Carlo algorithm So called because Markov chain – each instance generated in the sample is dependent on the previous instance Monte Carlo – statistical sampling method Perform a random walk through variable assignment space, collecting statistics as you go Start with a random instantiation, consistent with evidence variables At each step, for some nonevidence variable x, randomly sample its value by Given enough samples, MCMC gives an accurate estimate of the true distribution of values (no need for importance sampling because of Markov blanket)

Exercise: MCMC sampling p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart study .9 .7 study .5 .1 pass p(pass|…) smart smart prep prep fair .9 .7 .2 fair .1 Topological order = …? Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42

Loopy Propagation Belief propagation Loopy propagation Works only for polytrees (exact solution) Each evidence propagates once throughout the network Loopy propagation Let propagate continue until the network stabilize (hope) Experiments show Many BN stabilize with loopy propagation If it stabilizes, often yielding exact or very good approximate solutions Analysis Conditions for convergence and quality approximation are under intense investigation

Noisy-Or BN A special BN of binary variables (Peng & Reggia, Cooper) Causation independence: parent nodes influence a child independently Advantages: One-to-one correspondence between causal links and causal strengths Easy for humans to understand (acquire and evaluate KB) Fewer # of probabilities needed in KB Computation is less expensive Disadvantage: less expressive (less general)

Learning BN (from case data) Need for learning Experts’ opinions are often biased, inaccurate, and incomplete Large databases of cases become available What to learn Parameter learning: learning CPT when DAG is known (easy) Structural learning: learning DAG (hard) Difficulties in learning DAG from case data There are too many possible DAG when # of variables is large (more than exponential) n # of possible DAG 3 25 10 4*10^18 Missing values in database Noisy data

BN Learning Approaches Early effort: Based on variable dependencies (Pearl) Find all pairs of variables that are dependent of each other (applying standard statistical method on the database) Eliminate (as much as possible) indirect dependencies Determine directions of dependencies Learning results are often incomplete (learned BN contains indirect dependencies and undirected links)

BN Learning Approaches Bayesian approach (Cooper) Find the most probable DAG, given database DB, i.e., max(P(DAG|DB)) or max(P(DAG, DB)) Based on some assumptions, a formula is developed to compute P(DAG, DB) for a given pair of DAG and DB A hill-climbing algorithm (K2) is developed to search a (sub)optimal DAG Extensions to handle some form of missing values

BN Learning Approaches Minimum description length (MDL) (Lam, etc.) Sacrifices accuracy for simpler (less dense) structure Case data not always accurate Fewer links imply smaller CPD tables and less expensive inference L = L1 + L2 where L1: the length of the encoding of DAG (smaller for simpler DAG) L2: the length of the encoding of the difference between DAG and DB (smaller for better match of DAG with DB) Smaller L2 implies more accurate (and more complex) DAG, and thus larger L1 Find DAG by heuristic best-first search, that Minimizes L

BN Learning Approaches Neural network approach (Neal, Peng) For noisy-or BN

Compare Neural network approach with K2 # cases missing links extra links time 500 2/0 2/6 63.76/5.91 1000 0/0 1/1 69.62/6.04 2000 77.45/5.86 10000 161.97/5.83

Current research in BN Missing data BN with time Cyclic relations Missing value: EM (expectation maximization) Missing (hidden) variables are harder to handle BN with time Dynamic BN: assuming temporal relation obey Markov chain Cyclic relations Often found in social-economic analysis Using dynamic BN? Continuous variable Some work on variables obeying Gaussian distribution Connecting to other fields Databases Statistics Symbolic AI

Other formalisms for Uncertainty Fuzzy sets and fuzzy logic Ordinary set theory There are sets that are described by vague linguistic terms (sets without hard, clearly defined boundaries), e.g., tall-person, fast-car Continuous Subjective (context dependent) Hard to define a clear-cut 0/1 membership function

Fuzzy set theory height(harry) = 5’8” Tall(harry) = 0.5 height(john) = 6’5” Tall(john) = 0.9 height(harry) = 5’8” Tall(harry) = 0.5 height(joe) = 5’1” Tall(joe) = 0.1 Examples of membership functions Set of teenagers 0 12 19 1 - Set of young people Set of mid-age people 20 35 50 65 80

Fuzzy logic: many-value logic Fuzzy predicates (degree of truth) Connectors/Operators Compare with probability theory Prob. Uncertainty of outcome, Based on large # of repetitions or instances For each experiment (instance), the outcome is either true or false (without uncertainty or ambiguity) unsure before it happens but sure after it happens Fuzzy: vagueness of conceptual/linguistic characteristics Unsure even after it happens whether a child of tall mother and short father is tall unsure before the child is born unsure after grown up (height = 5’6”)

Empirical vs subjective (testable vs agreeable) Fuzzy set connectors may lead to unreasonable results Consider two events A and B with P(A) < P(B) If A => B (or A  B) then P(A ^ B) = P(A) = min{P(A), P(B)} P(A v B) = P(B) = max{P(A), P(B)} Not the case in general P(A ^ B) = P(A)P(B|A)  P(A) P(A v B) = P(A) + P(B) – P(A ^ B)  P(B) (equality holds only if P(B|A) = 1, i.e., A => B) Something prob. theory cannot represent Tall(john) = 0.9, ~Tall(john) = 0.1 Tall(john) ^ ~Tall(john) = min{0.1, 0.9) = 0.1 john’s degree of membership in the fuzzy set of “median-height people” (both Tall and not-Tall) In prob. theory: P(john  Tall ^ john Tall) = 0

Uncertainty in rule-based systems Elements in Working Memory (WM) may be uncertain because Case input (initial elements in WM) may be uncertain Ex: the CD-Drive does not work 70% of the time Decision from a rule application may be uncertain even if the rule’s conditions are met by WM with certainty Ex: flu => sore throat with high probability Combining symbolic rules with numeric uncertainty: Mycin’s Uncertainty Factor (CF) An early attempt to incorporate uncertainty into KB systems CF  [-1, 1] Each element in WM is associated with a CF: certainty of that assertion Each rule C1,...,Cn => Conclusion is associated with a CF: certainty of the association (between C1,...Cn and Conclusion).

Good things of Mycin’s CF method CF propagation: Within a rule: each Ci has CFi, then the certainty of Action is min{CF1,...CFn} * CF-of-the-rule When more than one rules can apply to the current WM for the same Conclusion with different CFs, the largest of these CFs will be assigned as the CF for Conclusion Similar to fuzzy rule for conjunctions and disjunctions Good things of Mycin’s CF method Easy to use CF operations are reasonable in many applications Probably the only method for uncertainty used in real-world rule-base systems Limitations It is in essence an ad hoc method (it can be viewed as a probabilistic inference system with some strong, sometimes unreasonable assumptions) May produce counter-intuitive results.

Dempster-Shafer theory A variation of Bayes’ theorem to represent ignorance Uncertainty and ignorance Suppose two events A and B are ME and EXH, given an evidence E A: having cancer B: not having cancer E: smoking By Bayes’ theorem: our beliefs on A and B, given E, are measured by P(A|E) and P(B|E), and P(A|E) + P(B|E) = 1 In reality, I may have some belief in A, given E I may have some belief in B, given E I may have some belief not committed to either one, The uncommitted belief (ignorance) should not be given to either A or B, even though I know one of the two must be true, but rather it should be given to “A or B”, denoted {A, B} Uncommitted belief may be given to A and B when new evidence is discovered

Representing ignorance Ex: q = {A,B,C} Belief function {A,B,C} 0.15 {A,B} 0.1 {A,C} 0.1 {B,C}0.05 {A} 0.1 {B} 0.2 {C}0.3 {} 0

Plausibility (upper bound of belief of a node) {A,B,C} 0.15 {A,B} 0.1 {A,C} 0.1 {B,C}0.05 {A} 0.1 {B} 0.2 {C}0.3 {} 0 Lower bound (known belief) Upper bound (maximally possible)

Evidence combination (how to use D-S theory) Each piece of evidence has its own m(.) function for the same q Belief based on combined evidence can be computed from {A,B} 0.3 {A} 0.2 {B} 0.5 {} 0 {A,B} 0.1 {A} 0.7 {B} 0.2 {} 0 normalization factor incompatible combination

{A,B} 0.3 {A} 0.2 {B} 0.5 {} 0 {A,B} 0.1 {A} 0.7 {B} 0.2 {A,B} 0.049 {A} 0.607 {B} 0.344 E1 E2 E1 ^ E2

Advantage: Disadvantages Ignorance is reduced from m1({A,B}) = 0.3 to m({A,B}) = 0.049) Belief interval is narrowed A: from [0.2, 0.5] to [0.607, 0.656] B: from [0.5, 0.8] to [0.344, 0.393] Advantage: The only formal theory about ignorance Disciplined way to handle evidence combination Disadvantages Computationally very expensive (lattice size 2^|q|) Assuming hypotheses are ME and EXH How to obtain m(.) for each piece of evidence is not clear, except subjectively

Appendix: A more complex example for variable elimination in BN “Asia” network: Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea

Need to eliminate: v,s,x,t,l,a,b Initial factors We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors

Eliminate: v Compute: Note: fv(t) = P(t) S L T A B X D We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Compute: Note: fv(t) = P(t) In general, result of elimination is not necessarily a probability term

Summing on s results in a factor with two arguments fs(b,l) V S L T A B X D We want to compute P(d) Need to eliminate: s,x,t,l,a,b Initial factors Eliminate: s Compute: Summing on s results in a factor with two arguments fs(b,l) In general, result of elimination may be a function of several variables

Note: fx(a) = 1 for all values of a !! B X D We want to compute P(d) Need to eliminate: x,t,l,a,b Initial factors Eliminate: x Compute: Note: fx(a) = 1 for all values of a !!

Eliminate: t Compute: We want to compute P(d) V S L T A B X D We want to compute P(d) Need to eliminate: t,l,a,b Initial factors Eliminate: t Compute:

Eliminate: l Compute: We want to compute P(d) Need to eliminate: l,a,b V S L T A B X D We want to compute P(d) Need to eliminate: l,a,b Initial factors Eliminate: l Compute:

Eliminate: a,b Compute: We want to compute P(d) Need to eliminate: b V S L T A B X D We want to compute P(d) Need to eliminate: b Initial factors Eliminate: a,b Compute:

Dealing with evidence How do we deal with evidence? S L T A B X D Dealing with evidence How do we deal with evidence? Suppose we are give evidence V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t)

Dealing with evidence V S L T A B X D We start by writing the factors: Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with These “select” the appropriate parts of the original factors given the evidence Note that fp(V) is a constant, and thus does not appear in elimination of other variables

Dealing with evidence V S L T A B X D Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:

Dealing with evidence V S L T A B X D Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get

Dealing with evidence V S L T A B X D Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get

Dealing with evidence V S L T A B X D Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get

Dealing with evidence V S L T A B X D Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get Eliminating b, we get