Marco Di Zio Dept. Integration, Quality, Research and Production

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Chapter 7 Sampling and Sampling Distributions
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Statistical Inference and Regression Analysis: GB Professor William Greene Stern School of Business IOMS Department Department of Economics.
Basics of discriminant analysis
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Lecture II-2: Probability Review
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
1 Multiple Imputation : Handling Interactions Michael Spratt.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
CHAPTER 9 Inference: Estimation The essential nature of inferential statistics, as verses descriptive statistics is one of knowledge. In descriptive statistics,
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Tutorial I: Missing Value Analysis
Copyright © Cengage Learning. All rights reserved. 14 Partial Derivatives.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Chapter 6 Sampling and Sampling Distributions
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Outline Sampling Measurement Descriptive Statistics:
Introduction to Quantitative Research
Copyright © Cengage Learning. All rights reserved.
CH 5: Multivariate Methods
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Maximum Likelihood & Missing data
Sample Mean Distributions
Hypothesis Testing: Hypotheses
Multiple Imputation.
Multiple Imputation Using Stata
Chapter 9 Hypothesis Testing.
Econ 3790: Business and Economics Statistics
Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Confidence Interval Estimation
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Copyright © Cengage Learning. All rights reserved.
The European Statistical Training Programme (ESTP)
Chapter 13: Inferences about Comparing Two Populations Lecture 7a
Preliminaries Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks.
Chapter: 9: Propensity scores
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
The European Statistical Training Programme (ESTP)
ESTIMATION.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Chapter 13: Item nonresponse
Presentation transcript:

Uncertainty in Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it

Outline The problem Identification problem A formal definition of uncertainty in SM The Normal Case The Multinomial Case Estimation of uncertainty Reduction of uncertainty: logical constraints

The problem Information on variables that are not jointly observed are requested It is a statistical problem with partial knowledge Approaches to fill the lack of knowledge Introduce the CIA Make use of auxiliary information What to do when CIA and Auxiliary Information cannot be used? 3

Identification problem - Example Let X be a dichotomous r.v., P(X = 1) = θ, P(X = 0) = 1 − θ, and R the corresponding indicator of missingness, i.e., R = 1 if X is observed and 0 otherwise. The probability θ can be written as: θ = P(R = 1)P(X = 1|R = 1) + P(R = 0)P(X = 1|R = 0). The critical prob. is P(X = 1|R = 0), we do not have information about that (missing). By using (MCAR, MAR), this probability can be estimated from the observed dataset: e.g. MCAR P(X = 1|R = 0) = P(X = 1|R = 1) 4

Identification problem - Example If we cannot use ’external’ information, the idea is to analyze all the possible solutions. Considering that 0 ≤ P(X = 1|R = 0) ≤ 1, θ may take values in the following interval: P(R = 1)P(X = 1|R = 1) ≤ θ ≤ P(R = 1)P(X = 1|R = 1) + P(R = 0) 5

Identification problem in Statistical Matching For stat. matching the idea is similar This kind of analysis is important for explorative analysis and it can give an indirect justification for the use of model based on CIA. This approach has been used by Kadane (1978), Rubin (1986),Moriarity and Scheuren (2001, 2003) and more explicitely by Raessler (2002), D’Orazio Di Zio and Scanu (2006). The Identification Problem for missing data by Manski (1995). 6

A formal definition of uncertainty in SM Let f(x, y, z; θ*) be the prob. dist. of (X,Y,Z), where is the unknown parameter. Without any information the uncertainty is given by the whole space 7

A formal definition of uncertainty in SM Let us suppose to know the partial distributions of (X,Y) and (X,Z), i.e. the parameters θXY and θXZ are known and equal to θ*XY and θ*XZ With this information, uncertainty on Q decreases: as the parameters may assume only the values compatible with the constraints θXY = θ*XY and θXZ =θ*XZ 8

The Normal Case (trivarate case) Let (X, Y,Z) be a trivariate normal with mean vector: and correlation matrix 9

The Normal Case (trivarate case) Let us suppose to know the bivariate distributions of (Y,X) and (Z,X), i.e. we do not have joint information on (Y,Z): 10

The Normal Case (trivarate case) All possible values for −1 ≤ ρY Z ≤ 1 that can determine a valid distribution are such that the matrix ρ is positive semidefinite, i.e. This implies that ρY Z belongs to the interval 11

Numerical example Let θ*XY and θ*X Z define the following correlation matrix: The uncertainty region is −0.1676 ≤ ρY Z ≤ 0.9992. Under CIA 12

Example: multivariate case Let us suppose: Uncertainty intervals are 13

Example: multivariate case Set of admissible values for (ρY Z1 , ρY Z2). In the picture, ρL e ρU are the extremes for ρY Z2 where ρY Z1 = 0.7 14

Multinomial case Let (X, Y,Z) be multinomial r.v. with true (unknown) parameters: with The natural space of the parameters is: With no information, it describes the uncertainty on parameters. 15

Multinomial case Let us suppose we know the marginal distribution of (X, Y ) and (X,Z) This information reduces the parameter space (and thus the uncertainty) to all the distributions such that: 16

Multinomial case Without information the limits are , If there are information there are some limits such that θLijk > 0 and θUijk < 1 for some (i, j, k) The parameters under the CIA are in the interval of the acceptable values, but not in the central point of the interval [θLijk, θUijk]. 17

Fréchet bounds Bounds can be obtained by means of the Fréchet bounds for joint distributions. Let F(x) and G(y) be the marginal distributions of H(x, y) then In case of categorical variables we have (using the conditional distributions) 18

Example The marginal distribution of f(X, Y ) is known, The marginal distribution f(X,Z) is knwon, 19

Example The uncertainty interval for the joint distribution are: 20

Stat. Matching FADN and FSS: variables 21

FADN and FSS: a set of estimates Knowledge on the marginal distribution X, and on the conditional distributions Y |X and Z|X, together with the Fréchet bounds, imply that all the estimates of the parameter θ.1k in (0.02959, 0.04903) are equally plausible. For the whole table (Y,Z), the intervals of estimates are the following: 22

Contingency table of (Y,Z) 23

FADN and FSS. CIA estimates The CIA estimates are (all included in the interval) 24

Evaluation of uncertainty The word uncertainty refers to the set of all the values of the inestimable parameters which are compatible with the estimated values of the estimable parameters. The objective is not a point estimate, but a set estimate. The length/volume of this set depends on: the strength of the relationship between the matching and the target variables possible constraints 25

Formal assessment of uncertainty A natural measure of uncertainty is given by divided by the number of uncertain parameters, i.e., those parameters such that 26

Estimate of uncertainty: a likelihood based approach For multinormal and multinomial distributions The estimation of the likelihood ridge is the set of all parameters θ compatible with the maximum likelihood estimates of the estimable parameters, 27

Example: multinomial case The set of solutions of Is the estimated likelihood ridge. Note that 28

External partial information may decrease uncertainty Example Analyse variables ’Age’ and ’Marital status’. Constraint between age and marital status: prob. of ’Age= younger than a certain legal age’ and ’Marital status = married’ must be zero. Effect of constraints is the reduction of all the possible values for the inestimable parameters, i.e. a reduction of the uncertainty. 29

Simulated example Variables ’Age’ (AGE), ’Educational Level’ (EDU) and ’Professional Status’ (PRO) observed on 2313 employees (people at least 15 years old) Original file has been randomly split in two almost equal subsets. ’Professional Status’ has been removed from the first subset (file A) 1165 obs. ’Educational Level’ has been removed from the second subset (file B), containing 1148 units, 30

Simulated example 31

Simulated example Contingency table of Prof. Status vs Age in file A 32

Simulated example Contingency table of Educ. vs Age in file B 33

Simulated example: constraints Structural zeros - Some struct. zeros are induced by the observed tables: e.g. in Italy a 17 years old person cannot have a university degree: Struct. zeros on (Y,Z) must be set: managers (PRO = ’M’) with at maximum a compulsory school educational level (EDU =’C’) should be set to zero: 34

Simulated example: constraints Inequality constraints - E.g., in this population, units with age in [23 − 64], a degree and manager are more frequent than units with the same age, educational level and professional status corresponding to clerk: 35

Simulated example: constraints We study how to vary the likelihood ridge in the three following situations: S0: unrestricted; S1: only structural zeros; S2: structural zeros and inequality constraints. 36

37

Selected references Rubin, D. B., (1987), Multiple Imputation for Non-Response in Surveys, Wiley. Manski, C. F. (1995), Identification Problems in the Social Sciences, Cambridge, Massachusetts: Harvard University Press. Kadane, J. B. (1978), “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159-179 (Reprinted in 2001, Journal of Official Statistics, 17, 423-433). Moriarity, C., e Scheuren, F. (2001), “Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure”, Journal of Official Statistics, 17, 407-422. Moriarity, C., e Scheuren, F. (2003), “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation”, Journal of Business & Economic Statistics, 21(1), 65-73.