Download presentation
Presentation is loading. Please wait.
Published byEdwin Wilcox Modified over 8 years ago
1
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008
2
► Starting from December 2006, ISTAT releases a statistical register of local units (LU) of enterprises (ASIA-LU), supplying every year information on local units, available until the 2001 only every ten years (Industry and Services Census). ► The set-up of the register have been carried out starting from an administrative/statistical informative base of addresses and using statistical models to estimate the activity status and other attributes of the local units. ► ASIA-LU provides (mainly) the number of local units and local units employees by municipality and economical activity. What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL What is the problem? Results
3
► Because of the nature of the available information, a selective editing to identify possible anomalous counts (LU/employees) in some combinations of the classification variables is indispensable ► The objective is to identify anomalous number of employees and/or local units classified by municipality and economical activity, taking into account the longitudinal information on LU, i.e. the local units registers (2004-2005) and the Census surveys (1991-1996-2001). Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL The contingency table is: Results What is the problem? What is the problem?
4
► Outlying observations in a set of data are generally viewed as deviations from a model assumption: the majority of observations -inliers- are assumed to come from a selected model (null model); few units – outliers- are thought of as coming from a different model. ► The outliers identification problem is then translated into the problem of identifying those observations that lie in an outlier region defined according to the selected null model where is a distribution family such and has density and Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem? What is the problem?
5
Outliers in Contingency Tables Let consider T categorical variables with possible outcomes. Each combination defines a cell of a contingency table. Given a set of data, each observation belongs to a combination and the frequency count of a cell can be denoted as Under a loglinear Poisson model, the cell counts are considered as a realizations of independent Poisson variables with expected values In a contingency table a cell count y i is view as outlier if it occurs with a small probability under the null model. Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Some Notation Results What is the problem?
6
The values should be chosen in a way that the probability that one or more outliers occurring in the contingency table do not exceed a given value. Assuming all the to be the same, then it can be shown that ► Assuming a Log linear Poisson model, the outlier region for each cell count y i is defined as Outliers in Contingency Tables where N is the set of all non-negative integers and ► The cell count y i is then an if it lies in the of Poisson’s distribution with parameters. Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
7
► Loglinear models for contingency table are Generalized Linear Models (GLM) where the expected cell count is with X is a full rank design matrix and a parameter vector. Outliers in Contingency Tables ► In the situation with only one measurement for each subject, i.e. without a correlation structure, the classical estimator for GLM is the maximum likelihood (ML) estimator. Because of the nature of ML estimator, the regression parameters estimates can be highly influenced by the presence of outlying cells. Some robust alternative have been proposed in literature. ► In practice to define the and identify the outlying cells, it is necessary to estimate the vector of parameters Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
8
Non parametric approach – Median Polish ► A procedure that supplies robust estimates in the analysis of contingency tables is the median polish method (Mosteller & Tukey, 1977; Emerson & Hoaglin, 1983).. Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results ► Given a contingency table with two factors, if an additive model is assumed, the value can be can be expressed as the sum of a constant term, an effect for level i of the row factor, an effect for level j of the column factor, and a casual term: ► The median polish procedure operates in an iterative manner on the table, calculating and subtracting row and column medians and ends when all the rows and columns have a median equal to zero. What is the problem?
9
Correlated count data ► There are several way to extend GLMs to take into account the correlation between subjects: marginal modeling approach (GEE), random effects models for categorical responses (GLMM), transitional models. In longitudinal studies, repeated data looks like : where Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL ► Repeated responses on the same subject tend to be more alike (generally positive correlated) then responses on different subject. Standard statistical procedures that ignore the between subjects correlation may produce invalid results. Results What is the problem?
10
Correlated count data - GEE ► A reasonable alternative to ML estimations for longitudinal count data is a multivariate generalization of the quasi- likelihood. Let ni x p matrix of covariate ► Rather then assuming a distribution for the response variable Y, in the quasi-likelihood method are specified only the moments: the mean which is a function of the linear predictor the variance that depends on the mean and a scale parameter n i vector of outcome Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
11
In the quasi-likelihood method, the estimate of the regression and nuisance parameter are the solutions of the generalized quasi-score function, called Generalized Estimating Equation (GEE): i s an diagonal matrix with the jth diagonal element is an correlation matrix The covariance matrix where: Correlated count data - GEE Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data -GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
12
Correlated count data - REGEE ► Because the QL estimators have properties similar to the ML estimators, the regression and the nuisance parameters can be influenced by outliers. ► Preisser and Quaqish (1999), in order to provide robust estimation of, introduced a generalization of GEE which include weights in the estimating equations in order to downweight the influential observation. generalized estimating equation (REGEE) ► They define the resistant generalized estimating equation (REGEE) as: Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
13
Correlated count data - REGEE where: i s an diagonal weight matrix containing robustness weights The weight have been chosen as function of the Pearson residuals, to ensure robustness with respect to outlying points in the y-space. We use as weight function is a bias eliminating constant determined by the marginal distribution of Y, where Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
14
Correlated count data - REGEE ► Robust estimators are also needed for the nuisance parameters and to avoid consequences on the regression parameters estimates ► If the moment estimations of and are: and where an autoregressive AR(1) working correlation matrix has been specified (i.e ) Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
15
Outliers identification procedures, based on previously estimated parameters with the three different estimation methods, have been compared in a simulation study. Simulation of Correlated count data In the study 4x4x5 tables are simulated where The parameter vector and is a row of the design matrix X obtained as a dummy coding Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
16
► Correlated Poisson variables are simulated using the overlapping sum (OS) algorithm (Park and Shin, 1998). Simulation of Correlated count data ► If is a random vector with a mean and covariance matrix, in the OS method is decompose in where is an nxl matrix of 0’s and 1’s and is a l-vector of independent Poisson variables. The dimension l depends on the structure of the covariance matrix and the matrix is defined in a way that has the proper mean vector and covariance matrix ► Once is defined the means of can be obtained solving the equation Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
17
Outliers in the simulated tables are produced by replacing the selected cell Y ijt by Max(inl(α,μ ij ))+1 or Min(inl(α,μ ij ))-1 where α has been chosen as (10 -2, 10 -4, 10 -8 ) Simulation of Correlated count data Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
18
Results Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
19
Results Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results Ρ=0,1 %outliers=0,05 Ρ=0,1 %outliers=0,01 Ρ=0,8 %outliers=0,05 Ρ=0,8 %outliers=0,01 What is the problem?
20
Outliers Detection in ASIA-UL The outlier identification procedures have been applied in the control process of the Statistical Register of the Local Units (ASIA-UL). Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
21
Outliers Detection in ASIA-UL Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data - REGEE Correlated count data-GEE Simulation of Correlated count data Outliers Detection in ASIA-UL Results What is the problem?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.