Download presentation
Presentation is loading. Please wait.
Published byKevin Blake Modified over 6 years ago
1
Uncertainty in Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013
Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it
2
Outline The problem Identification problem A formal definition of uncertainty in SM The Normal Case The Multinomial Case Estimation of uncertainty Reduction of uncertainty: logical constraints
3
The problem Information on variables that are not jointly observed are requested It is a statistical problem with partial knowledge Approaches to fill the lack of knowledge Introduce the CIA Make use of auxiliary information What to do when CIA and Auxiliary Information cannot be used? 3
4
Identification problem - Example
Let X be a dichotomous r.v., P(X = 1) = θ, P(X = 0) = 1 − θ, and R the corresponding indicator of missingness, i.e., R = 1 if X is observed and 0 otherwise. The probability θ can be written as: θ = P(R = 1)P(X = 1|R = 1) + P(R = 0)P(X = 1|R = 0). The critical prob. is P(X = 1|R = 0), we do not have information about that (missing). By using (MCAR, MAR), this probability can be estimated from the observed dataset: e.g. MCAR P(X = 1|R = 0) = P(X = 1|R = 1) 4
5
Identification problem - Example
If we cannot use ’external’ information, the idea is to analyze all the possible solutions. Considering that 0 ≤ P(X = 1|R = 0) ≤ 1, θ may take values in the following interval: P(R = 1)P(X = 1|R = 1) ≤ θ ≤ P(R = 1)P(X = 1|R = 1) + P(R = 0) 5
6
Identification problem in Statistical Matching
For stat. matching the idea is similar This kind of analysis is important for explorative analysis and it can give an indirect justification for the use of model based on CIA. This approach has been used by Kadane (1978), Rubin (1986),Moriarity and Scheuren (2001, 2003) and more explicitely by Raessler (2002), D’Orazio Di Zio and Scanu (2006). The Identification Problem for missing data by Manski (1995). 6
7
A formal definition of uncertainty in SM
Let f(x, y, z; θ*) be the prob. dist. of (X,Y,Z), where is the unknown parameter. Without any information the uncertainty is given by the whole space 7
8
A formal definition of uncertainty in SM
Let us suppose to know the partial distributions of (X,Y) and (X,Z), i.e. the parameters θXY and θXZ are known and equal to θ*XY and θ*XZ With this information, uncertainty on Q decreases: as the parameters may assume only the values compatible with the constraints θXY = θ*XY and θXZ =θ*XZ 8
9
The Normal Case (trivarate case)
Let (X, Y,Z) be a trivariate normal with mean vector: and correlation matrix 9
10
The Normal Case (trivarate case)
Let us suppose to know the bivariate distributions of (Y,X) and (Z,X), i.e. we do not have joint information on (Y,Z): 10
11
The Normal Case (trivarate case)
All possible values for −1 ≤ ρY Z ≤ 1 that can determine a valid distribution are such that the matrix ρ is positive semidefinite, i.e. This implies that ρY Z belongs to the interval 11
12
Numerical example Let θ*XY and θ*X Z define the following correlation matrix: The uncertainty region is − ≤ ρY Z ≤ Under CIA 12
13
Example: multivariate case
Let us suppose: Uncertainty intervals are 13
14
Example: multivariate case
Set of admissible values for (ρY Z1 , ρY Z2). In the picture, ρL e ρU are the extremes for ρY Z2 where ρY Z1 = 0.7 14
15
Multinomial case Let (X, Y,Z) be multinomial r.v. with true (unknown) parameters: with The natural space of the parameters is: With no information, it describes the uncertainty on parameters. 15
16
Multinomial case Let us suppose we know the marginal distribution of (X, Y ) and (X,Z) This information reduces the parameter space (and thus the uncertainty) to all the distributions such that: 16
17
Multinomial case Without information the limits are ,
If there are information there are some limits such that θLijk > 0 and θUijk < 1 for some (i, j, k) The parameters under the CIA are in the interval of the acceptable values, but not in the central point of the interval [θLijk, θUijk]. 17
18
Fréchet bounds Bounds can be obtained by means of the Fréchet bounds for joint distributions. Let F(x) and G(y) be the marginal distributions of H(x, y) then In case of categorical variables we have (using the conditional distributions) 18
19
Example The marginal distribution of f(X, Y ) is known,
The marginal distribution f(X,Z) is knwon, 19
20
Example The uncertainty interval for the joint distribution are: 20
21
Stat. Matching FADN and FSS: variables
21
22
FADN and FSS: a set of estimates
Knowledge on the marginal distribution X, and on the conditional distributions Y |X and Z|X, together with the Fréchet bounds, imply that all the estimates of the parameter θ.1k in ( , ) are equally plausible. For the whole table (Y,Z), the intervals of estimates are the following: 22
23
Contingency table of (Y,Z)
23
24
FADN and FSS. CIA estimates
The CIA estimates are (all included in the interval) 24
25
Evaluation of uncertainty
The word uncertainty refers to the set of all the values of the inestimable parameters which are compatible with the estimated values of the estimable parameters. The objective is not a point estimate, but a set estimate. The length/volume of this set depends on: the strength of the relationship between the matching and the target variables possible constraints 25
26
Formal assessment of uncertainty
A natural measure of uncertainty is given by divided by the number of uncertain parameters, i.e., those parameters such that 26
27
Estimate of uncertainty: a likelihood based approach
For multinormal and multinomial distributions The estimation of the likelihood ridge is the set of all parameters θ compatible with the maximum likelihood estimates of the estimable parameters, 27
28
Example: multinomial case
The set of solutions of Is the estimated likelihood ridge. Note that 28
29
External partial information may decrease uncertainty
Example Analyse variables ’Age’ and ’Marital status’. Constraint between age and marital status: prob. of ’Age= younger than a certain legal age’ and ’Marital status = married’ must be zero. Effect of constraints is the reduction of all the possible values for the inestimable parameters, i.e. a reduction of the uncertainty. 29
30
Simulated example Variables ’Age’ (AGE), ’Educational Level’ (EDU) and ’Professional Status’ (PRO) observed on 2313 employees (people at least 15 years old) Original file has been randomly split in two almost equal subsets. ’Professional Status’ has been removed from the first subset (file A) 1165 obs. ’Educational Level’ has been removed from the second subset (file B), containing 1148 units, 30
31
Simulated example 31
32
Simulated example Contingency table of Prof. Status vs Age in file A
32
33
Simulated example Contingency table of Educ. vs Age in file B 33
34
Simulated example: constraints
Structural zeros - Some struct. zeros are induced by the observed tables: e.g. in Italy a 17 years old person cannot have a university degree: Struct. zeros on (Y,Z) must be set: managers (PRO = ’M’) with at maximum a compulsory school educational level (EDU =’C’) should be set to zero: 34
35
Simulated example: constraints
Inequality constraints - E.g., in this population, units with age in [23 − 64], a degree and manager are more frequent than units with the same age, educational level and professional status corresponding to clerk: 35
36
Simulated example: constraints
We study how to vary the likelihood ridge in the three following situations: S0: unrestricted; S1: only structural zeros; S2: structural zeros and inequality constraints. 36
37
37
38
Selected references Rubin, D. B., (1987), Multiple Imputation for Non-Response in Surveys, Wiley. Manski, C. F. (1995), Identification Problems in the Social Sciences, Cambridge, Massachusetts: Harvard University Press. Kadane, J. B. (1978), “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., (Reprinted in 2001, Journal of Official Statistics, 17, ). Moriarity, C., e Scheuren, F. (2001), “Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure”, Journal of Official Statistics, 17, Moriarity, C., e Scheuren, F. (2003), “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation”, Journal of Business & Economic Statistics, 21(1),
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.