For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR 4)Outputs / applications
1) Updates on DAMES DAMES Node extended period ends 31 st July 2011 Some ongoing funding in E-Stat & NeISS projects until 2012 Dissemination workshop in Oxford in June 2011 Most funded posts have ended (1 programmer still funded) Our main contributions have been GESDE services for specialist data resources and the data services supporting them (recent paper)recent paper Training events / online materials Social care and e-Health application projects
GESDE: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating 17/MAR/2010 DIR workshop: Handling Social Science Data 3
The data curation tool 4 The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way It includes a file storage system allowing users to upload files and access their own and others files
2) Bringing DAMES inputs to e-Stat a)Possible mechanisms for data linking & connecting StatJR with GESDE resources (and/or other files) Filestore system, or manual inputs b)Data-management templates / pre-analysis functionality c)Workflows/e-book inputs documentation for replication
Supporting data linkage The current framework needs manual linkage (e.g. 2 files on pc) Templates could be written to link with fixed file(s); or named files + fixed qualities (e.g.: matching vars with gb91soc90.dta) ** Sample Stata code: global soc occ global new occ sav temp.dta, replace use clear keep if ukempst==0 keep soc90 mcamsis rename soc90 $soc rename mcamsis ${new}_mcamsis sort $soc sav temp2.dta, replace use temp.dta sort $soc merge $soc using temp2.dta keep if _merge==1 | _merge==3 drop _merge
Stat-JR within the DAMES filestore? Could we install Stat-JR on the Stirling unix system and allow it to be invoked, via our portal, on datasets/templates within the portal? – would allow users to add own data & link with our data – (would need programmers; could give team here access to portal) (Uploading through file browser; could potentially also use curation tool) (Actually, Templates, too, could be placed online/shared in this manner)
b) Data-oriented templates – Deterministic file matching routines – E.g. BHPS file matching routine for compiling data across multiple files (cf. PanelWhiz) – Recodes (manual input or external file input) – Aggregating/standardising variables – Templates for weighted models in relevant packages – {Perhaps: responding to leverage/diagnostics} I could do many of these via templates which compile and run Stata/R command files Is there any value in that (cf. just doing them in Stata/R!) Is there value in writing code for the e-Stat engine itself?
E.g.: BHPS panel merge macro (similar to PanelWhiz)
e.g. Recode examples (shown before) Stata syntax: recode var1 1/5=1 6/10=2 *=3, generate(var2) SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2. Data matrix format: -> Manual entry available in StatJR, but doesnt seem to preserve metadata?
c) Workflows / e-books Two main objectives: Documentation for replication..I think syntax for Stat-JR would help here.. Sensitivity analysis across multiple measures / models / data permutations Data storage/access Linking different variables Compiling results across many models
Idea of auto-compiled user notes? Full account of models constructed (What was that?) – Of benefit to novice and advanced practitioners – Potentially a part of the e-notebook, but could be a linked online guide (static) – E-Stat commands to provide documentation for replication – Terminologies used for the model/other user notes – Software equivalents or near equivalents (including estimator specs) – Algebraic expression and model abstract ?possible tools for storing/compiling multiple model results – (mentioned previously, cf. est table in Stata)
Any missing components of model description user notes? (slight modification from Sept 2010) 1) E-Stat model syntax: model{ for (i in 1:length(y36)) { y36[i] ~ dnorm(mu[i], tau) mu[i] <- cons[i] * beta0 + y8[i] * beta1 } …. 2) E-Stat model: Template1Lev = Linear regression using MCMC 3) Model abstract/background information: E.g. something like: This model is suitable for a single outcome measure with a continuous distribution. It is comparable to the widely used OLS regression model, and usually leads to identical results.. [etc]. See … for further description. 4) Algebraic representation: [Image from Latex code] 5)Specification of the model in other popular packages: BUGS syntax: [input here] MLwiN syntax: [input here] R: [input here] Stata: MCMC estimation routines not available 6) Data copy [Data after model, e.g. including new variables] 7) Outputs from model Log file; images 8)Variables summary [Summary stats]
Est store demo here 14
3) Some feedback on Stat-JR My own current thoughts {see sep. review notes file} – Look and feel – a syntactical record of the model specification..? – back and forward options; add # categories to summary; Pre- specified default settings (e.g. burn-in, cons, etc) – Make links to users datasets easy – data entry template(?) – Export data as part of output in popular formats – handling large numbers of data files & folders – any way to tie in metadata about the records, e.g. variable labels?
Dataset metadata in StatJR? Comparable options for variable labels, value labels, missing data are widely used/desirable Effort to bring these in could help Also relates to having data open in other package at same time Could a functional form tool be incorporated? For every dataset associates variables with a basic functional form, i.e. metric, nominal or ordinal, that user can set/change Impacts on data options: e.g. separate summary window to summarise categorical variables such as frequency table/bar chart; options to derive dummy variables and recode values for categorical variables (some of this is similar to whats available on NESSTAR) Use this data in some models options (or pref. just let the user decide..)?
Social science users Ive shown Alpha version to a couple of colleagues {Comments notes doc from Chris Playford} Impressed by the range of options and potential for software comparisons Frightened by the specification options/terms; statistical outputs; point and click format; and the current installation requirements The most common critical comment has been why? – as in [Stata] already does everything I need and/or I bet this doesnt work with large and complex data!! Think about niche – sophisticated users can already use software, whilst basic users dont want advanced options? I suspect that training / pedagogical value is relevant here
4) Outputs / Applications Applications Id most like to test...: – Evaluating different socio-economic measures for model performance (cf. GESDE services) – Large scale data compilation/analysis To highlight some output opportunities: – LWS/E-Stat/DAMES (NCRM/DRS) collaborative research seminar +book proposal, Sept/Oct 2011 on Modelling key variables in social science research – Social stratification research conference, Sept 2011 – Training support - an installation package plus good illustrative template for use at workshops, e.g. Essex Summer School course, July 2011?