For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs

1) Progress updates DAMES Node services of a hopefully generic/transferable nature – GESDE services on occupations, educational qualifications and ethnicity (www.dames.org.uk)www.dames.org.uk – Data curation tool – Data fusion tool for merging data and recoding/standardising variables

GESDE: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating 17/MAR/2010 DIR workshop: Handling Social Science Data 3

The data curation tool 4 The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Fusion tool (invoking R) - scenarios ComponentDescriptionFusion Tool Requirements Input 1 = Data 1(e.g. working data)Declaration of file location (online or in irods system) Input 2 = Data 2(e.g. external information)Ditto (or other expert input) Linkage mechanism1.Deterministic 2.Probabilistic 3.Recode/Transform Declaration of which of the three types of link is to be used, and file formats involved Argument specification Listing of required arguments to the mechanism (e.g. input and output files; variable names, linked to standard classifications) R script invocation(e.g. condor submission)API or other device collects the above inputs and applies them to an R template Output = Data 3File 1+[a bit of file 2]New file is generated, to be supplied to user

Currently: Expected inputs to e-Stat, Autumn 2010 First applications in integrating DAMES data preparation tools with e-Stat model-building systems 1){Coordination/planning on WP1.6 workflow tools for pre- analysis} (?De Roure, McDonald, Michaelides, Lambert, Goldstein, Southampton RA?) 2)Template construction with applications using variable recodes and other pre-analysis adjustments from DAMES systems with view to generating generic template facilities 3)Preparation of some typical example survey data/models (e.g. 10k+ cases, 50+ variables) and their implementation in e-Stat e.g. Cross-national/longitudinal comparability examples 4)Possible e-Stat inputs to DAMES workshops (Nov 24-5/Jan 25-6)

7a) Links with DAMES..DAMES Node core funding period is Feb 2008- Jan 2011.. Further discussion of integrating pre-analysis services from DAMES into e-Stat facilities and templates Appetite for other application-oriented contributions? Alternative measures for the changing circumstances during childhood application? ?Preparation of illustrative application(s) with complex survey data Would need data spec. and broad analytical plan

Pre-analysis options associated with DAMES Things that could be facilitated by the fusion tool (R scripts) in combination with the curation tool and if relevant specialist data (e.g. from GESDE) Alternative measures/derived data [via deterministic matches/variable transformation routines] Using GESDE: Occupations, educational quals, ethnicity (?Health oriented measures using Obesity e-Lab dropbox facility?) Generic routines: Arithmetic standardisation tools Replicability of measurement construction (e.g. syntax log of tasks) Other possible data/review possibilities [new but easy] Routine for summarizing data (see wish list) [new, probably not easy] Weighting data options; routine for identifying values with high leverage / high residuals (?provided elsewhere) Probabilistic matching routines

Model for data locations? Curation tool can be used to attach variable names and metadata to facilitate variable processing We then have a model of storing the data in a secure remote location (irods server), from where jobs can be run on it (e.g. in R) – Is this a suitable model for e-Stat? – Is there another data location model? – Or better to supply scripts to run on files in an unspecified location?

Fusion tool (invoking R) - scenarios ComponentDescriptionFusion Tool Requirements Input 1 = Data 1(e.g. working data)Declaration of file location (online or in irods system) Input 2 = Data 2(e.g. external information)Ditto (or other expert input) Linkage mechanism1.Deterministic 2.Probabilistic 3.Recode/Transform Declaration of which of the three types of link is to be used, and file formats involved Argument specification (example overleaf)Listing of required arguments to the mechanism (e.g. input and output files; variable names, linked to standard classifications) R script invocation(e.g. condor submission)API or other device collects the above inputs and applies them to an R template Output = Data 3File 1+[a bit of file 2]New file is generated, to be supplied to user

Mechanism 1: Deterministic link Here information is joined on the basis of exact matching values Example condor job: universe=vanilla executable = /usr/bin/R arguments = --slave --vanilla --file=bhps_test.R --args /home/pl3/condor/condor_5/wave1.dta /home/pl3/condor/condor_5/wave17.dta /home/pl3/condor/condor_5/bhps_combined.dat pid wave file pid wave file notification = Never log = test1.log output = test1.out error = test1.err queue

The input files here are Stata format data The output is plain text format data There are 3 linking variables, which happen to have the same names on both files Ie pid wave file on file 1, and also pid wave file on file 2 Different names would be fine but the same number of variables on both files is essential Different total numbers of linking variables are fine (most often there is only 1) Different R templates can be used to read data in different formats (e.g. Stata, SPSS, plain text), though exported data can only be readily supplied in plain text

The R template being run in the above application is: args <- as.factor(commandArgs(trailingOnly = TRUE)); options(useFancyQuotes=TRUE) fileAinp <- as.character(args[1]) fileBinp <- as.character(args[2]) fileCout <- as.character(args[3]) ## library(foreign) fileA <- read.dta(fileAinp, convert.factors=F) fileB <- read.dta(fileBinp, convert.factors=F) nargs <- sum(!is.na(args)) allvars <- args[4:nargs] nargs2 <- (sum(!is.na(allvars))) first_vars <- as.character(allvars[1:(nargs2/2)]) second_vars <- as.character(allvars[((nargs2/2)+1):nargs2]) ###### combined2 <- merge(fileA, fileB, by.x=c(first_vars), by.y=c(second_vars), all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") ) ###### write.table(combined2, file=fileCout, col.names=TRUE, sep=",") ###

Mechanism 2: Probabilistic link This is when data form different files are linked on criteria which are not just an exact match of values, but include some probabilistic algorithm – E.g. for each person in data 1 with the same characteristics, select a random person from the pool of people in data 2 who are age 35-40, male, education = high, marital status=married, and link their voting preference data to the person in data 1 Other implementation requirements are equivalent to deterministic matching, so long as criteria for the matching algorithm is determined Status: We dont yet have a pool of probabilistic matching algorithms; weve one so far, which is random matching as in the above example

Mechanism 3: Recoding/Transforming Here the scenario is the application of an externally provided data recode, or other externally instructed arithmetic operation, onto a variable within data 1 E.g. take the educational qualifications measure which is coded 1 to 20 in data 1; recode 1 thru 5 to the value 1, 6 thru 10 to the value 2, and all others to the value 3 (this is statistically equivalent to a deterministic match, but some recode inputs may not list every possible value) E.g. take the measure of income and calculate its mean standardised values within subgroups defined by regions (e.g. minus regional mean, divided by regional standard deviation) – Status/Requirement: We need to develop a suitable mechanism to take recode style information/instructions from relevant external sources, and convert it into a suitable format for applying either a recode or merge routine in R – Wed like to support: Recode information supplied via SPSS and Stata syntax specifications; data file matrices; and, potentially, manual specifications Other transformation procedures supplied in advance from a small range of possibilities (e.g. mean standardisation; log transformation, cropping of extreme values) plus a small set of related arguments (e.g. category variables)

Recode examples: Stata syntax: recode var1 1/5=1 6/10=2 *=3, generate(var2) SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2. Data matrix format: -> Manual entry interface (SPSS example) :

=> Linking data management services into the e-Stat template Add data review and data construction elements, plus possible additional requests for modelling options Data review: single script with minor variations on data Data construction: As above, these involve variable operations and linkages with other files/resources – Derive measure on occupations, educational qualifications or ethnicity given information on the character of existing data Collected via the cutation tool, or, more realistically, from a short range of pre- supplied alternatives? – Distributional transformations including standardisation; numeric transformation; review variable distribution Model extensions: Weight cases options; leverage review;

8d) Wish lists/Suggestions Include tools for describing/summarizing data Outputs from generic summarize commands in R linked to all templates Tool for reviewing model results / leverage, feeding back into model respecificiation Tools for applying survey weight variables to analysis(?) User notes for models constructed (What was that?) – Of benefit to novice and advanced practitioners – Potentially a part of the e-notebook, but could be a linked online guide (static) – E-Stat commands to provide documentation for replication – Terminologies used for the model/other user notes – Software equivalents or near equivalents (including estimator specs) – Algebraic expression and model abstract Tools for storing/compiling multiple model results – (mentioned previously, cf. est table in Stata)

Possible components of model description user notes 1) E-Stat model syntax model{ for (i in 1:length(y36)) { y36[i] ~ dnorm(mu[i], tau) mu[i] <- cons[i] * beta0 + y8[i] * beta1 } # Priors beta0 ~ dflat() beta1 ~ dflat() tau ~ dgamma(0.001000, 0.001000) } 2) E-Stat model and name: Template1Lev = Linear regression using MCMC 3) Model abstract: E.g. something like: This model is suitable for a single outcome measure with a continuous distribution. It is comparable to the widely used OLS regression model, and usually leads to identical results, but by using the MCMC estimation method it can lead to different parameter estimates in some circumstances. The model presumes no structured relationship between different cases in the data. See … for further description. 4) Other common names for this model: Bayesian regression; etc 5)Specification of the model in other popular packages: BUGS syntax: [as E-Stat syntax] MLwiN syntax: [input here] R: [input here] Stata: MCMC estimation routines not available SPSS: …Etc… 6) Algebraic representation [Y=bX + e, etc]

Est store demo here 20

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

Similar presentations

Presentation on theme: "For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

Similar presentations

Presentation on theme: "For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs."— Presentation transcript:

Similar presentations

About project

Feedback