Artificial data in social science

Slides:



Advertisements
Similar presentations
NCeSS e-Stat quantitative node Prof. William Browne & Prof. Jon Rasbash University of Bristol.
Advertisements

Random Forest Predrag Radenković 3237/10
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Administrative Data Research Centre – Scotland Chris Dibben.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
© 2010 Pearson Prentice Hall. All rights reserved Regression Interval Estimates.
Psychology 202b Advanced Psychological Statistics, II February 10, 2011.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Estimation and Testing for Population Proportions Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Confidence intervals.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Multivariate Statistics for the Environmental Sciences Peter J. A. Shaw Chapter 1 Introduction.
Chapter 11 Simple Regression
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
Advanced Higher Statistics Data Analysis and Modelling Hypothesis Testing Statistical Inference AH.
BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Week 111 Review - Sum of Normal Random Variables The weighted sum of two independent normally distributed random variables has a normal distribution. Example.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Bike Sharing Demand Prediction PRESENTED BY:- AKSHAY PATIL 14MCB1031 RESEARCH FACILITATOR: PROF. BVANSS PRABHAKAR RAO M.TECH 1 ST YEAR RBL.
Using a Random Forest model to predict enrollment Ward Headstrom Institutional Research Humboldt State University CAIR
GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.
Looking for statistical twins
Advanced statistical methods for credit risk modeling in practice
Mail Call Us: , , Data Science Training In Ameerpet
Computer Simulation Henry C. Co Technology and Operations Management,
Introduction to Machine Learning
BINARY LOGISTIC REGRESSION
k-Nearest neighbors and decision tree
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Multiple Imputation using SOLAS for Missing Data Analysis
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Intro to Machine Learning
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Chapter Six Normal Curves and Sampling Probability Distributions
Oliver Schulte Machine Learning 726
Introduction to Data Mining and Classification
Advanced Analytics Using Enterprise Miner
R Data Manipulation Bootstrapping
(classification & regression trees)
Working with missing Data
The European Statistical Training Programme (ESTP)
Section 3: Estimating p in a binomial distribution
Additional notes on random variables
Simple Linear Regression
Additional notes on random variables
Data Science in Industry
Analytics – Statistical Approaches
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Joni Nunnery and Helmut Schneider
Decision trees MARIO REGIN.
Ns-3 Training Simulator core ns-3 training, June 2016.
Analysis on Accelerated Learning Cohorts
Fractional-Random-Weight Bootstrap
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Artificial data in social science Richard Skeggs rskeggs@essex.ac.uk @rickskeggs

What is Artificial Data First proposed in 1993. Often referred to as Synthetic data. In effect made up data. Mimics real data.

Why Use Artificial Data Reduces privacy concerns. Bolsters small datasets for models. Used for developing and training a model. Can be cheaper to obtain.

Creating Artificial Data Python synthetic data tools. Synthetic Data Vault Python & Numpy R synthetic data tools. Synthpop

Synthetic Data Vault Patki, N., R. Wedge, and K. Veeramachaneni. “The Synthetic Data Vault.” In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410, 2016. https://doi.org/10.1109/DSAA.2016.49.

Synthetic Data Vault 34 data scientist given datasets. No statistically different result between synthetic & control data.

Synthetic Data in Python Trumania: Creates test data based on a scenario. Synthetic-data-generator: generate random data follows uniform distribution.

Synthetic Data in R SynthPop – R package. Uses a logistic model predict variables based on real data. Recursive model building up variables.

SynthPop Variables synthesised using CART. classification and regression trees. The model is created by binary recursive partitioning. Missing values are modelled. A model is applied to each column.

SynthPop Syn(data, Visit.sequence, k) Usage: Data – a dataframe or matrix containing original data. Visit.sequence - column indices specifying the order of synthesis K – size of synthetic data (optional) Method – meyhod used for synthesising by default CART (optional) Example: syn.data<-syn(orig.data[c(7:28),], visit.sequence=c(7:28), k=39)

SynthPop Administrative Data Research Centre – Scotland | Chris Dibben| 20 June 2014

SynthPop The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results. synthpop: Bespoke Creation of Synthetic Data in R Beata Nowok, Gillian M. Raab, Chris Dibben Published 2015

SynthPop if(!require(synthpop)) { install.packages("synthpop") } library(synthpop) syn.data<-syn(iris[c(1:150),], visit.sequence=c(1:150), k=150, proper=TRUE) for(x in 1:nrow(iris)) { if(x==1) { synth.data<-as.data.frame(syn.data$syn[x, ]) } else { synth.data[x,]<-syn.data$syn[x,] } } View(synth.data) plot(synth.data$Petal.Length, synth.data$Petal.Width) The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop Original Data Synthetic Data The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop syn.data<-syn(iris, visit.sequence=c(1:150), k=150, proper=TRUE) The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop syn.data<-syn(iris, method="parametric", visit.sequence=c(1:150), k=150, proper=TRUE, default.method = c("normrank", "logreg", "polyreg", "polr")) The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop syn.data<-syn(iris, method="parametric", visit.sequence=c(1:150), k=150, proper=TRUE, default.method = c("normrank", "logreg", "polyreg", "polr"), seed=6000) The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

SynthPop The diagram shows the estimates and 95% confidence intervals for Z statistics from a logistic regression of intention to go abroad for work for observed and synthetic data. This diagram shows just how closely the synthetic data matches the observed results.

Any Questions? Richard Skeggs rskeggs@essex.ac.uk 5 April 2018