Updating with incomplete observations (UAI-2003) Gert de Cooman Marco Zaffalon IDSIA SYSTeMS research group BELGIUM http://ippserv.ugent.be/~gert gert.decooman@ugent.be “Dalle Molle” Institute for Artificial Intelligence SWITZERLAND http://www.idsia.ch/~zaffalon zaffalon@idsia.ch
What are incomplete observations? A simple example C (class) and A (attribute) are Boolean random variables C = 1 is the presence of a disease A = 1 is the positive result of a medical test Let us do diagnosis Good point: you know that p(C = 0, A = 0) = 0.99 p(C = 1, A = 1) = 0.01 Whence p(C = 0 | A = a) allows you to make a sure diagnosis Bad point: the test result can be missing This is an incomplete, or set-valued, observation {0,1} for A What is p(C = 0 | A is missing)?
Example ctd Kolmogorov’s definition of conditional probability seems to say p(C = 0 | A {0,1}) = p(C = 0) = 0.99 i.e., with high probability the patient is healthy Is this right? In general, it is not Why?
Why? Because A can be selectively reported e.g., the medical test machine is broken; it produces an output the test is negative (A = 0) In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0 The patient is definitely ill! Compare this with the former naive application of Kolmogorov’s updating (or naive updating, for short)
Modeling it the right way Observations-generating model o is a generic value for O, another random variable o can be 0, 1, or * (i.e., missing value for A) IM = p(O | C,A) should not be neglected! The correct overall model we need is p(C,A)p(O | C,A) p(C,A) (c,a) Distribution generating pairs for (C,A) Complete pair (not observed) IM Incompleteness Mechanism (IM) o Actual observation (o) about A
What about Bayesian nets (BNs)? Asia net Let us predict C on the basis of the observation (L,S,T) = (y,y,n) BN updating instructs us to use p(C | L = y,S = y,T = n) to predict C (T)uberculosis = n (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea
Asia ctd Should we really use p(C | L = y,S = y,T = n) to predict C? (V,H,D) is missing (L,S,T,V,H,D) = (y,y,n,*,*,*) is an incomplete observation p(C | L = y,S = y,T = n) is just the naive updating By using the naive updating, we are neglecting the IM! Wrong inference in general
New problem? Problems with naive updating were already clear since 1985 at least (Shafer) Practical consequences were not so clear How often does naive updating make problems? Perhaps it is not a problem in practice?
Grünwald & Halpern (UAI-2002) on naive updating Three points made strongly naive updating works CAR holds i.e., neglecting the IM is correct CAR holds With missing data: CAR (coarsening at random) = MAR (missing at random) = p(A is missing | c,a) is the same for all pairs (c,a) CAR holds rather infrequently The IM, p(O | C,A), can be difficult to model 2 & 3 = serious theoretical & practical problem How should we do updating given 2 & 3?
What this paper is about Have a conservative (i.e., robust) point of view Deliberately worst case, as opposed to the MAR best case Assume little knowledge about the IM You are not allowed to assume MAR You are not able/willing to model the IM explicitly Derive an updating rule for this important case Conservative updating rule
1st step: plug ignorance into your model Fact: the IM is unknown p(O{0,1,*} | C,A) = 1 a constraint on p(O | C,A) i.e. any distribution p(O | C,A) is possible This is too conservative; to draw useful conclusions we need a little less ignorance Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | A) i.e., all the IMs which do not depend on what you want to predict Use this set of IMs jointly with prior information p(C,A) p(C,A) (c,a) Known prior distribution Complete pair (not observed) IM Unknown Incompleteness Mechanism o Actual observation (o) about A
2nd step: derive the conservative updating Let E = evidence = observed variables, in state e Let R = remaining unobserved variables (except C) Formal derivation yields: All the values for R should be considered In particular, updating becomes: Conservative Updating Rule (CUR) minrR p(c | E = e,R = r) p(c | o) maxrR p(c | E = e,R = r)
Posterior confidence [0.42,0.71] CUR & Bayesian nets Evidence: (L,S,T) = (y,y,n) What is your posterior confidence on C = y? Consider all the joint values of nodes in R Take min & max of p(C = y | L = y,S = y,T = n,v,h,d) Posterior confidence [0.42,0.71] Computational note: only Markov blanket matters! (T)uberculosis = n (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea
A few remarks The CUR… is based only on p(C,A), like the naive updating produces lower & upper probabilities can produce indecision
CUR & decision-making Decisions Indecision? c’ dominates c’’ (c’,c’’ C) if for all r R , p(c’ | E = e, R = r) > p(c’’ | E = e, R = r) Indecision? It may happen that r’,r’’ R so that: p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’) and p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’) There is no evidence that you should prefer c’ to c’’ and vice versa (= keep both)
Decision-making example Evidence: E = (L,S,T) = (y,y,n) = e What is your diagnosis for C? p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) Both C = y and C = n are plausible Evidence: E = (L,S,T) = (y,y,y) = e C = n dominates C = y: “cancer” is ruled out (T)uberculosis (V)isit to Asia (S)moking = y Lung (C)ancer? Bronc(H)itis Abnorma(L) X-rays = y (D)yspnea
Algorithmic facts CUR restrict attention to Markov blanket State enumeration still prohibitive in some cases e.g., naive Bayes Dominance test based on dynamic programming Linear in the number of children of class node C However: decision-making possible in linear time, by provided algorithm, even on some multiply connected nets!
On the application side Important characteristics of present approach Robust approach, easy to implement Does not require changes in pre-existing BN knowledge bases based on p(C,A) only! Markov blanket favors low computational complexity If you can write down the IM explicitly, your decisions/inferences will be contained in ours By-product for large networks Even when naive updating is OK, CUR can serve as a useful preprocessing phase Restricting attention to Markov blanket may produce strong enough inferences and decisions
What we did in the paper Theory of coherent lower previsions (imprecise probabilities) Coherence Equivalent to a large extent to sets of probability distributions Weaker assumptions CUR derived in quite a general framework
Concluding notes There are cases when: IM is unknown/difficult to model MAR does not hold Serious theoretical and practical problem CUR applies Robust to the unknown IM Computationally easy decision-making with BNs CUR works with credal nets, too Same complexity Future: how to make stronger inferences and decisions Hybrid MAR/non-MAR modeling?