Checking For Prior-Data Conflict by Michael Evans and Hadas Moshonov University of Toronto
Introduction Statistical analyses are based on inputs from analyst In a Bayesian context: - assumptions about the sampling model (S, {fθ : θ ∈ Ω}, μ) - choice of a prior (Ω, π, ν) If these are in "error" then subsequent inferences about model components are at least suspect. So checking for the validity of these components is a necessary step in a statistical analysis.
Two types of error Sampling model in error if the observed data s0 ∈ S is surprising for each fθ (2) Prior-data conflict when the prior places most of its mass on θ values for which s0 is surprising Several papers have considered model checking in this context - Guttman (1967) - Box (1980) - Rubin (1984) - Gelman, Meng and Stern (1996) But do not really distinguish between the different types of error These errors should be assessed separately
Why? If the sampling model is wrong it must be modified (ignoring practical versus statistical significance) If a prior-data conflict exists it may be possible to ignore this when the sampling model is viewed as correct and the amount of data is large enough So it is sometimes possible to correct for the effects of prior-data conflict simply by increasing the amount of data First check for the failure of the sampling model - many frequentist methods for this - Bayesian methods, Bayarri and Berger (2000) If sampling model in error no point in checking for prior-data conflict
Notation How do we assess whether or not a prior-data conflict exists? Prior-predictive measure for s given by where If Π is proper so M is a probability measure posterior probability measure Π (· | s0) has density, wrt υ, For T : S → Ί denote the marginal densities by fθT , wrt support measure λ on Ί , and this leads to the marginal prior predictive density for T, wrt λ, Posterior predictive distribution of T has density, wrt λ, How do we assess whether or not a prior-data conflict exists?
Prior-data Conflict: Sufficiency Basic idea: a prior-data conflict exists whenever the data provide little or no support to those values of θ where the prior places its support Compare the effective support of the prior with the region where the likelihood is high. How? If the observed likelihood L(· | s0) = f.(s.0) is a surprising value from M then this would seem to indicate that a prior-data conflict exists The likelihood is equivalent to a minimal sufficient statistic T so we compare T (s0) to MT Appropriate to restrict to T Theorem 1. Suppose T is a sufficient statistic for the model {fθ : θ ∈Ω} for data s. Then the conditional prior predictive distribution of the data s given T is independent of the prior π. also Theorem 2. If L(· | s0) is nonzero only on a θ-region where π places no mass then T (s0) is an impossible outcome for MT . Why compare T (s0) with MT rather than MT (· | s0)?
Example - Location-normal Suppose s = (s1, ..., sn) ~ N (θ, 1) distribution with θ ∈ R1 N (θ0 , ) prior on θ MT given by T(s) = ~ N (θ0 , + 1/n) Prior predictive P-value (1) Prior predictive results in the standardization by ( + 1/n)1/2 rather than by σ0 If the true value of θ is θ*, as n → ∞ (1) converges almost surely to When σ0 → ∞ then (1) converges to 0 (a diffuse prior simply indicates that all values are equally likely) and no conflict can be found As σ0 → 0, so we have a very precise prior, then (1) will definitely find evidence of a prior-data conflict for large n unless θ0 is the true value
Posterior predictive P-value (2) As n→∞ (2) converges almost surely to 1, irrespective of the true value of θ So if we were to use the posterior predictive we would never conclude that a prior-data conflict exists
Prior-data Conflict: Ancillarity Compare the observed value of U (T(s)) with its marginal prior-predictive distribution for any function U For certain choices of U this will clearly not be appropriate e.g., if U is ancillary then the marginal prior-predictive distribution does not depend on the prior T (s0) may be a surprising value simply because U (T(s0)) is surprising for some ancillary U and we want to avoid this So remove the variation in the prior-predictive distribution of T, that is associated with U, by conditioning on U We avoid the necessity of conditioning on an ancillary whenever we have a complete minimal sufficient statistic T (Basu’s theorem)
If and are ancillary and for some h then condition on U2 (T(s0)) Condition on maximal ancillaries whenever possible Lack of a general method for obtaining maximal ancillaries Lehmann and Scholz (1992) Lack of unique maximal ancillary can cause problems in frequentist approaches to inference Not a problem here
Example - Mixtures Cox and Hinkley (1974) Response x is either from a N (θ, ) or N (θ, ) distribution where θ ∈ R1 is unknown and , are both known and unequal Particular instrument used is chosen according to c ~ Bernoulli (p) where p is known (c, x) is minimal sufficient and c is ancillary When c = i we would use the prior predictive to check for prior-data conflict Generally, if we have that (x, u) is minimal sufficient for a model with x | u ~ fθ (· | u) and u ~ h a maximal ancillary, use
Noninformative Priors Various definitions are available for expressing what it means for a prior to be noninformative. - Kass and Wasserman (1996) - Bernardo (1979) - Berger and Bernardo (1992) A somewhat different requirement for noninformativity arises from consideration about existence of prior-data conflict. For if a prior is such that we would never conclude that a prior-data conflict exists, no matter what data is obtained, then it seems reasonable to say that such a prior is at least a candidate for being called non-informative. So we consider the absence of the possibility of any prior-data conflict as a necessary characteristic of noninformativity rather then as a characterization of this concept.
Diagnostics for Ignoring Prior-data Conflict Suppose we found evidence of prior data conflict. What to do next? Use a different prior. How to choose such a prior? In some circumstances, the answer would be to collect more data. For, with sufficient amount of data, the effect of the prior on our inference is immaterial. But, in some circumstances this is not possible. So, in a given context we would like to know if we have enough data to ignore the prior-data conflict and, if not, how much more data do we need. Intuitively, if the inference about θ that we are interested in making are not strongly dependent on the prior, then we might feel that we can ignore any prior-data conflict.
Need some quantitative assessment of this and there are several possibilities. When there is a prior that is noninformative, we can compare posterior inferences obtained via these two priors. If these inferences do not differ by an amount that is considered of practical importance then it seems reasonable to ignore the prior data conflict. Amount of difference will depend on the particular application. In general we would compute the divergence between the posterior under the informative and noninformative prior. However, we need to state a cut-off value for the divergence, below which we would not view the difference between the distributions as being material. It is not always obvious how to obtain a noninformative prior. In these circumstance we can simply select another prior that seems reasonable for the problem and compare inferences.
Factoring the Joint Distribution Joint distribution of (θ, s) as given by the measure Pθ × Π Mss T leads to factorization P (· | T) × PTθ × Π P (· | T) depends only on the choice of the sampling model {Pθ : θ ∈ Ω} Compare the observed data s0 against P (· | T) to assess sampling model PTθ × Π can be written as MT × Π(· | T) Cmpare the observed value T(s0) with the distribution MT to check for prior-data conflict for maximal ancillary U we can factor MT × Π(· | T) as PU ×MT (· |U) × Π(· | T) PU depends only on the choice of the sampling model {Pθ : θ ∈ Ω} and so we can also compare the observed value U (s0) with PU to check sampling model When U is not independent of T, instead compare T(s0) with MT (· |U) to check for prior-data conflict Inferences about θ use the posterior distribution Π(· | T)