Download presentation
Presentation is loading. Please wait.
Published byCalvin Wilcox Modified over 9 years ago
1
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007
2
Outline Motivation Disclosure Limitation Methods Risk Assessment Simulation Study Results & Conclusions
3
Motivation Iowa Department of Revenue (IDR) –Collects and maintains individual tax return data Legislative Services Agency (LSA) –Examines impact of tax law changes on liability Current system –LSA submits requests to IDR –IDR computes liability, reports to LSA –Occurs several times each year –Inefficient for both IDR and LSA
4
Solutions –Secure/remote access server Data are not released Some analyses suppressed –Statistical disclosure limitation (SDL) Tabular Microdata –enable IDR to provide LSA with data set –allow LSA to compute liability with ease and accuracy –MUST ENSURE CONFIDENTIALITY of RECORDS!
5
Establishment Connection Very skew distributions, unusual associations among distributions Groups of variables are related to one another in unusual ways Similar to business tax data or business expenditure/revenue data Confidentiality is critical
6
Traditional Approaches Recoding (e.g. aggregation) Noise addition Data swapping Data suppression Imputation Combinations of these
7
Our Approach Synthetic microdata simulation –Retain key demographic variables –Simulate values for some variables Quantile regression conditional on key variables Compute fitted values at selected quantiles –Impute values for remaining variables Hot deck + rank swap Hot deck based on simulated income variables
8
Quantile Regression – = “tilted absolute value function” for quantile – = linear function of predictors (x i ) performed in R –quantreg package –rq function Quantile Regression, Koenker 2004
9
Simulate via Quantile Regression Estimate for quantiles from the set For each record on variable y –Randomly select ~ Uniform(0,1) –Compute fitted given x at above and below –Interpolate to obtain = simulated value
10
IDR Application: Key Demographic Variables Number of dependents –0, 1, 2,… –Categorized into 0 1 ≥2 County –1,…,99 –Categorized into 4 population size groups State filing status 1.single 2.married filing joint 3.married filing separate on combined return 4.married filing separate returns 5.head of household 6.widow(er) with dependent child –Categorized into 1 2 and 3 4, 5, and 6
11
IDR Application: Quantile Regression for wages
12
Hot Deck –Mahalanobis distance –closest 20 records Rank Swap –compute sample rank, r –draw random rank, r*, from discrete Uniform[r-10, r+10] –impute value from record with rank r* IDR Application: Hot Deck and Rank Swap for Federal Tax
13
Disclosure Risk Measurement Using methods detailed in Reiter (2005) and Duncan and Lambert (1986, 1989) Examine specific records –Original records –Released records –Model intruder behavior to assess disclosure risk Simulation Study
14
Original and Released Records
15
Intruder Behavior Target record, t –Intruder has information on target –Attempts to match t in released records Released records j=1,…,r in Z Probability that record j belongs to target t is As –probability decreases –disclosure risk decreases
16
Simulation Study Schemes for SDL influence divisions of A into Ap (available, perturbed) and Ad (available, unperturbed).
17
SDL Schemes in Simulation Study No SDL Swap 30% marital status Swap 30% marital status and minority Recode age into 5 year intervals Recode age into 5 year intervals and swap 30% marital status and minority Simulation via quantile regression and hot deck
18
Targets Intruder has information on target, t, and wants to match with released records Consider a few targets –Unique record –Rare record –Common record
19
Results from Simulation Study targetNo SDL Marital swap Marital and minority swap Age recode Swaps and recode Quantile regression and hot deck unique 110.104610.0178 0.0895 rare 0.33330.10440.13040.05260.0225 0.0016 common 0.03850.0320 0.00680.0055 0.0008
20
Conclusions & Future Work Risk behaves as we expect –increased SDL –decreased disclosure risk (except for unique!) Perform SDL techniques to American Community Survey data at US Census Bureau Compare traditional techniques to quantile regression and hot deck by computing risk Measure utility of released data
21
Acknowledgements Iowa Department of Revenue Iowa’s Legislative Services Agency National Institute of Statistical Sciences US Census Bureau Dissertation Fellowship Award
22
References Duncan,G.T. and Lambert, D. 1986. “Disclosure-Limited Data Dissemination,” Journal of the American Statistical Association, 81, 10-28. Duncan,G.T. and Lambert, D. 1989. “The Risk of Disclosure for Microdata,” Journal of Business and Economic Statisistics, 7, 207-217. Koenker, R. 2005. “Introduction,” Quantile Regression, Econometric Society Monograph Series, Cambridge University Press.Quantile Regression Reiter, J.P. 2005. “Estimating Risks of Identification Disclosure in Microdata”, Journal of the American Statistical Association, 100, 472, 1103- 1113.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.