Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.

Slides:

Advertisements

Similar presentations

Individual level synthetic regression models Session 4.

Advertisements

Random Forest Predrag Radenković 3237/10

1 Statistical Modeling  To develop predictive Models by using sophisticated statistical techniques on large databases.

Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.

Statistical Tools for Evaluating the Behavior of Rival Forms: Logistic Regression, Tree & Forest, and Naive Discriminative Learning R. Harald Baayen University.

A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?

Robert Plant != Richard Plant. Sample Data Response, covariates Predictors Remotely sensed Build Model Uncertainty Maps Covariates Direct or Remotely.

Simple Linear Regression

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Multiple Regression Models. The Multiple Regression Model The relationship between one dependent & two or more independent variables is a linear function.

Multivariate Data Analysis Chapter 4 – Multiple Regression.

Additive Models and Trees

Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley

End of Chapter 8 Neil Weisenfeld March 28, 2005.

An Introduction to Logistic Regression

Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.

Validation of predictive regression models Ewout W. Steyerberg, PhD Clinical epidemiologist Frank E. Harrell, PhD Biostatistician.

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

Simple Linear Regression

Model Building III – Remedial Measures KNNL – Chapter 11.

Introduction Chapter 1. What is Physical Fitness? Physical fitness is the ability to carry out daily tasks with vigor and alertness without undue fatigue.

Introduction Chapter 1. What is Physical Fitness? Physical fitness is the ability to carry out daily tasks with vigor and alertness without undue fatigue.

Business Intelligence and Decision Modeling Week 11 Predictive Modeling (2) Logistic Regression.

1 Modeling Coherent Mortality Forecasts using the Framework of Lee-Carter Model Presenter: Jack C. Yue /National Chengchi University, Taiwan Co-author:

Multivariate Data Analysis Chapter 5 – Discrimination Analysis and Logistic Regression.

When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.

Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Practical Statistics Regression. There are six statistics that will answer 90% of all questions! 1. Descriptive 2. Chi-square 3. Z-tests 4. Comparison.

Multivariate Dyadic Regression Trees for Sparse Learning Problems Xi Chen Machine Learning Department Carnegie Mellon University (joint work with Han Liu)

New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.

Computational statistics, lecture3 Resampling and the bootstrap  Generating random processes  The bootstrap  Some examples of bootstrap techniques.

Consistency in reporting contraception between spouses in Bangladesh: evidence from recent demographic and health survey Mohammad Amirul Islam Sabu S.

Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.

A first order model with one binary and one quantitative predictor variable.

Konstantina Christakopoulou Liang Zeng Group G21

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Case Selection and Resampling Lucila Ohno-Machado HST951.

Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡

More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Topics, Summer 2008 Day 1. Introduction Day 2. Samples and populations Day 3. Evaluating relationships Scatterplots and correlation Day 4. Regression and.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Bagging and Random Forests

Robert Plant != Richard Plant

Introduction to Machine Learning and Tree Based Methods

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.

ENM 310 Design of Experiments and Regression Analysis

Eco 6380 Predictive Analytics For Economists Spring 2016

Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Generalized Linear Models

Direct or Remotely sensed

Linear Regression.

Artificial data in social science

Additional notes on random variables

Additional notes on random variables

Simple Linear Regression

Somi Jacob and Christian Bach

Ensemble learning Reminder - Bagging of Trees Random Forest

Classification with CART

Happiness and Stocks Ali Javed, Tim Stevens

Bootstrapping and Bootstrapping Regression Models

Chaoran Hu1,4, Xiao Tan2,4, Qing Pan3, Yong Ma4, Jaejoon Song4

Presentation transcript:

Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland Utility of synthetic microdata generated using tree-based methods

DATA

RESEARCH TOPIC

Tree-based methods Y j ~ (Y 0,Y 1,...,Y j−1 ) synthetic observed Utility evaluation Generating synthetic data

Tree-based methods  Classification and regression trees (CART)  Bagging  Random forest

 Build a tree (binary splits) Y j ~ (Y 0,..., Y j−1 )  Generate Y j by:  Running Y 0,…,Y j-1 down the tree  Sampling Y j from the leaves Life satisfaction health status = fair / poor age < 50 sex = male CART

 Build a tree (binary splits) Y j ~ (Y 0,..., Y j−1 )  Generate Y j by:  Running Y 0,…,Y j-1 down the tree  Sampling Y j from the leaves Life satisfaction health status = fair / poor age < 50 sex = male healthy male aged 56 synthetic value of life satisfaction CART

CART splitting rules  Association with the response variable  Gini index and deviance

Random forest and bagging  Collection of trees:  Bootstrap samples from the original data set  Sample of predictors as split candidates

Synthetic data utility  General measures:  consider overall microdata similarity  predict effectiveness without fitting all models  Specific measures:  consider analysis-specific similarity  high specific utility is desirable for researchers in preliminary analysis

General utility measures Probability of being in the synthetic data Group ~ (Y 0,Y 1,...,Y j ) observed synthetic

Analysis-specific utility measures  Comparison of models fit with both the observed data and the synthetic data  Overlap of coefficient confidence intervals  Standardized difference of coefficient estimates

Synthetic data  SHeS 2008 & 2010, 46 variables  Replacing all values for all variables  10 synthetic data sets for each method:  CART 1 (correlation)  CART 2 (Gini index and deviance)  [RF] Random forest (100 trees)  [BAG] Bagging (100 trees)  [PARA] Parametric  [SAMP] Sampling

Estimated regression models  Linear: mental well-being Ywemwbs ~ Sex + age + age2 + sptinPAg + hrstot10 + AllNature + Gym + Other + Longill08 + simd maritalg  Logistic: life satisfaction Ylsg ~ Sex + ag16g10 + ParentInf + hrsptg10 + eqvinc + maritalg + XOwnRnt08 + econac08 + drkcat3 + cigst3 + porftvg3 + AllNature  Poisson: time of sport activities hrsspt10 ~ Sex + ag16g10 + hrstot10 + URINDSC + XCare + Longill08 + eqvinc + maritalg + Active + ParentInf + econac08

General and analysis-specific utility

Analysis-specific utility

Conclusions  Useful completely synthetic data can be generated using automated methods  More research needed to validate synthesising methods in complex settings  Tree-building procedure matters