Updating with incomplete observations (UAI-2003)

Slides:



Advertisements
Similar presentations
Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland
Advertisements

Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Foundations of Artificial Intelligence 1 Bayes’ Rule - Example  Medical Diagnosis  suppose we know from statistical data that flu causes fever in 80%
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian Belief Networks
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Thanks to Nir Friedman, HU
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Naive Bayes Classifier
NAÏVE CREDAL CLASSIFIER 2 : AN EXTENSION OF NAÏVE BAYES FOR DELIVERING ROBUST CLASSIFICATIONS 이아람.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
Slides for “Data Mining” by I. H. Witten and E. Frank.
4 Proposed Research Projects SmartHome – Encouraging patients with mild cognitive disabilities to use digital memory notebook for activities of daily living.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Matching ® ® ® Global Map Local Map … … … obstacle Where am I on the global map?                                   
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Probability Distributions
Lecture 7: Constrained Conditional Models
CS 188: Artificial Intelligence Spring 2007
CS 2750: Machine Learning Directed Graphical Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
Inference in Bayesian Networks
3. The X and Y samples are independent of one another.
Naive Bayes Classifier
COMP61011 : Machine Learning Probabilistic Models + Bayes’ Theorem
Uncertainty Chapter 13.
CS 4/527: Artificial Intelligence
Bayesian Classification
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
How to handle missing data values
Representing Uncertainty
CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Propagation Algorithm in Bayesian Networks
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
Professor Marie desJardins,
CS 188: Artificial Intelligence Fall 2007
Class #21 – Monday, November 10
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Chapter 7: The Normality Assumption and Inference with OLS
Read R&N Ch Next lecture: Read R&N
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Updating with incomplete observations (UAI-2003) Gert de Cooman Marco Zaffalon IDSIA SYSTeMS research group BELGIUM http://ippserv.ugent.be/~gert gert.decooman@ugent.be “Dalle Molle” Institute for Artificial Intelligence SWITZERLAND http://www.idsia.ch/~zaffalon zaffalon@idsia.ch

What are incomplete observations? A simple example C (class) and A (attribute) are Boolean random variables C = 1 is the presence of a disease A = 1 is the positive result of a medical test Let us do diagnosis Good point: you know that p(C = 0, A = 0) = 0.99 p(C = 1, A = 1) = 0.01 Whence p(C = 0 | A = a) allows you to make a sure diagnosis Bad point: the test result can be missing This is an incomplete, or set-valued, observation {0,1} for A What is p(C = 0 | A is missing)?

Example ctd Kolmogorov’s definition of conditional probability seems to say p(C = 0 | A  {0,1}) = p(C = 0) = 0.99 i.e., with high probability the patient is healthy Is this right? In general, it is not Why?

Why? Because A can be selectively reported e.g., the medical test machine is broken; it produces an output  the test is negative (A = 0) In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0 The patient is definitely ill! Compare this with the former naive application of Kolmogorov’s updating (or naive updating, for short)

Modeling it the right way Observations-generating model o is a generic value for O, another random variable o can be 0, 1, or * (i.e., missing value for A) IM = p(O | C,A) should not be neglected! The correct overall model we need is p(C,A)p(O | C,A) p(C,A) (c,a) Distribution generating pairs for (C,A) Complete pair (not observed) IM Incompleteness Mechanism (IM) o Actual observation (o) about A

What about Bayesian nets (BNs)? Asia net Let us predict C on the basis of the observation (L,S,T) = (y,y,n) BN updating instructs us to use p(C | L = y,S = y,T = n) to predict C (T)uberculosis = n (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea

Asia ctd Should we really use p(C | L = y,S = y,T = n) to predict C? (V,H,D) is missing (L,S,T,V,H,D) = (y,y,n,*,*,*) is an incomplete observation p(C | L = y,S = y,T = n) is just the naive updating By using the naive updating, we are neglecting the IM! Wrong inference in general

New problem? Problems with naive updating were already clear since 1985 at least (Shafer) Practical consequences were not so clear How often does naive updating make problems? Perhaps it is not a problem in practice?

Grünwald & Halpern (UAI-2002) on naive updating Three points made strongly naive updating works  CAR holds i.e., neglecting the IM is correct  CAR holds With missing data: CAR (coarsening at random) = MAR (missing at random) = p(A is missing | c,a) is the same for all pairs (c,a) CAR holds rather infrequently The IM, p(O | C,A), can be difficult to model 2 & 3 = serious theoretical & practical problem How should we do updating given 2 & 3?

What this paper is about Have a conservative (i.e., robust) point of view Deliberately worst case, as opposed to the MAR best case Assume little knowledge about the IM You are not allowed to assume MAR You are not able/willing to model the IM explicitly Derive an updating rule for this important case Conservative updating rule

1st step: plug ignorance into your model Fact: the IM is unknown p(O{0,1,*} | C,A) = 1 a constraint on p(O | C,A) i.e. any distribution p(O | C,A) is possible This is too conservative; to draw useful conclusions we need a little less ignorance Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | A) i.e., all the IMs which do not depend on what you want to predict Use this set of IMs jointly with prior information p(C,A) p(C,A) (c,a) Known prior distribution Complete pair (not observed) IM Unknown Incompleteness Mechanism o Actual observation (o) about A

2nd step: derive the conservative updating Let E = evidence = observed variables, in state e Let R = remaining unobserved variables (except C) Formal derivation yields: All the values for R should be considered In particular, updating becomes: Conservative Updating Rule (CUR) minrR p(c | E = e,R = r)  p(c | o)  maxrR p(c | E = e,R = r)

Posterior confidence  [0.42,0.71] CUR & Bayesian nets Evidence: (L,S,T) = (y,y,n) What is your posterior confidence on C = y? Consider all the joint values of nodes in R Take min & max of p(C = y | L = y,S = y,T = n,v,h,d) Posterior confidence  [0.42,0.71] Computational note: only Markov blanket matters! (T)uberculosis = n (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea

A few remarks The CUR… is based only on p(C,A), like the naive updating produces lower & upper probabilities can produce indecision

CUR & decision-making Decisions Indecision? c’ dominates c’’ (c’,c’’ C) if for all r R , p(c’ | E = e, R = r) > p(c’’ | E = e, R = r) Indecision? It may happen that r’,r’’ R so that: p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’) and p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’) There is no evidence that you should prefer c’ to c’’ and vice versa (= keep both)

Decision-making example Evidence: E = (L,S,T) = (y,y,n) = e What is your diagnosis for C? p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) Both C = y and C = n are plausible Evidence: E = (L,S,T) = (y,y,y) = e C = n dominates C = y: “cancer” is ruled out (T)uberculosis (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea

Algorithmic facts CUR  restrict attention to Markov blanket State enumeration still prohibitive in some cases e.g., naive Bayes Dominance test based on dynamic programming Linear in the number of children of class node C However: decision-making possible in linear time, by provided algorithm, even on some multiply connected nets!

On the application side Important characteristics of present approach Robust approach, easy to implement Does not require changes in pre-existing BN knowledge bases based on p(C,A) only! Markov blanket  favors low computational complexity If you can write down the IM explicitly, your decisions/inferences will be contained in ours By-product for large networks Even when naive updating is OK, CUR can serve as a useful preprocessing phase Restricting attention to Markov blanket may produce strong enough inferences and decisions

What we did in the paper Theory of coherent lower previsions (imprecise probabilities) Coherence Equivalent to a large extent to sets of probability distributions Weaker assumptions CUR derived in quite a general framework

Concluding notes There are cases when: IM is unknown/difficult to model MAR does not hold Serious theoretical and practical problem CUR applies Robust to the unknown IM Computationally easy decision-making with BNs CUR works with credal nets, too Same complexity Future: how to make stronger inferences and decisions Hybrid MAR/non-MAR modeling?