1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.

Slides:



Advertisements
Similar presentations
Multiple Indicator Cluster Surveys Data Entry and Processing.
Advertisements

ESDS user support materials and resources: how to use them Support Services Royal Statistical Society, London 13 February 2009.
Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social.
The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.
Access to Economic and Social Data via the UK Data Archive Jack Kneeshaw UKDA.
Accessing the NCDS and BCS70 via the Economic and Social Data Service Jack Kneeshaw NCDS/BCS70 workshop 27 October 2004 ESDS Longitudinal.
Accessing the NCDS and the BCS70 via the Economic and Social Data Service Jack Kneeshaw NCDS/BCS70 workshop 21 February 2007 ESDS Longitudinal.
New Services for Data Creators and Providers Louise Corti, Head ESDS Qualidata/ Outreach & Training Alasdair Crockett, ESDS Data Services Manager.
An Introduction to the UK Data Archive and the Economic and Social Data Service November 2007 Jack Kneeshaw, UKDA.
Economic and Social Data Service June What is the ESDS? national service supporting the archiving, dissemination and use of social and economic.
Accessing the UK Longitudinal Studies via the ESDS Jack Kneeshaw UK Data Archive/Economic and Social Data Service 21 June 2004 ESDS Longitudinal.
Accessing the NCDS and the BCS70 via the Economic and Social Data Service Jack Kneeshaw NCDS/BCS70 workshop 16 October 2007 ESDS Longitudinal.
The Economic and Social Data Service (ESDS) Karen Dennison UK Data Archive Improving access to government datasets 18 January 2007.
Accessing the MCS via the Economic and Social Data Service Jack Kneeshaw and Alasdair Crockett MCS workshop 20 November 2003 ESDS Longitudinal.
For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.
For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR.
Training opportunities – What do I need? And where can I get it? Vernon Gayle
ESRC Future Strategy for Resources and Methods Professor Ian Diamond Chief Executive ESRC.
ESDS Resources Anthony Rafferty ESDS Government Centre for Census and Survey Research University of Manchester.
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
ESDS Resources for EFS Users Jo Wathan ESDS Government Centre for Census and Survey Research University of Manchester.
1 Welcome to the Williamson Building… In the event of fire alarm Alarm is a constant ring Head left down corridor, down stairs Assembly point on grass.
ESDS Government Resources for the GLF/ GHS ESDS Government Centre for Census and Survey Research University of Manchester.
ESDS Government Resources for Government Crime Surveys ESDS Government Centre for Census and Survey Research University of Manchester.
ESDS Resources for BCS Users Vanessa Higgins Centre for Census and Survey Research University of Manchester.
Using ESDS Government Resources for Health Research Dr. Anthony Rafferty ESDS Government Centre for Census and Survey Research University of Manchester.
ESDS Government Resources for the LFS and APS Anthony Rafferty ESDS Government Centre for Census and Survey Research University of Manchester.
ESDS Resources for BCS Users Vanessa Higgins ESDS Government Centre for Census and Survey Research University of Manchester.
DAMES - Data Management through e-Social Science 1 DAMES: Data Management through e-Social Science NCeSS Research Node University of Stirling / University.
Stata and logit recap. Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with.
Teaching Statistics Using Stata Software Susan Hailpern BSN MPH MS Department of Epidemiology and Population Health Albert Einstein College of Medicine.
1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
Plannes security for items, variables and applications NEPS User Rights Management.
T HE W EB - BASED I NTERFACE TO C ENSUS I NTERACTION D ATA - WICID Presentation to the ESRC Research Methods Festival Adam Dennett Centre for Interaction.
GEODE Project introduction and summary, 12/12/05 GEODE: Grid Enabled Occupational Data Environment GEODE Project introduction and summary, 12/12/05 Motivation.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
UK Initiatives to Train Researchers in the Use of Microdata Reza Afkhami UK Data Archive / UK Data Service DWB 1 st Regional Workshop-Ljubljana, 25 April.
GEOG3025 Census and administrative data sources 2: Outputs and access.
Access to Economic and Social Data via the UK Data Archive Jack Kneeshaw UKDA.
ESDS Resources Anthony Rafferty ESDS Government Centre for Census and Survey Research University of Manchester.
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
Access to the LSYPE and associated resources at the Economic and Social Data Service Jack Kneeshaw LSYPE workshop 1 October 2009 ESDS Longitudinal.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
TheDataWeb & DataFerrett Rebecca Blash Bill Hazard The DataWeb Applications Branch U.S. Census Bureau.
Sep 2005:LDA - ONS1 Event history data structures and data management Paul Lambert Stirling University Prepared for “Longitudinal Data Analysis for Social.
ESDS resources for managing data Jack Kneeshaw Economic and Social Data Service University of Essex, 27 January 2009.
Longitudinal Data Analysis Professor Vernon Gayle
Introduction to STATA for Clinical Researchers Jay Bhattacharya August 2007.
Data documentation and metadata for data archiving and sharing Managing research data well workshop London, 30 June 2009 Manchester, 1 July 2009.
New and easier ways of working with aggregate data and geographies from UK censuses Justin Hayes UK Data Service Census Support.
Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.
Sep 2006: LDA1 Data sources and Data structure: Panel data Paul Lambert Stirling University Prepared for “Longitudinal Data Analysis for Social Science.
Good Statistics with Microsoft Excel Howard Grubb, Roger Stern and Colin Grayer Department of Applied Statistics 6th June 2001.
ESDS resources for managing and analysing data Beate Lichtwardt Economic and Social Data Service UK Data Archive Research Method Festival, Oxford 1 July.
ESDS - Support and resources Beate Lichtwardt, ESDS/UKDA British Library Conference Centre, London 9 March 2009.
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
GEODE - Durban ISA RC33, July 2006 Utilising a Grid Enabled Occupational Data Environment GEODE – Paper presented.
 Using SHS Lite in support of policy development in Fife Coryn Barclay Community Budgeting Project Manager, Corporate Research, Fife Council.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Comparison of different output options from Stata
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
Analysis of Experiments
GEODE – Sharing Occupational Data Through The Grid Dr. Paul Lambert, Dr. Vernon Gayle, Prof. Ken Prandy, Prof. Richard Sinnott, Prof. Ken Turner, Koon.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
: LSS1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, Data and Variable Management Paul Lambert.
Online survey analysis tools Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar.
Tools of data analysis Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 2 on.
Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.
Making graphs with academic software tools (SPSS, Stata and R) Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership.
From Data to Paper [via Stata!] Tim Croudace and Jon Heron ^ Jon works in Bristol too ;-) ESRC Funded Researcher Development Initiative Project Grant:
Presentation transcript:

1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative

2 Master Class 1: Data Analysis with Stata, 23/1/08 Introductions and generic resources , 2V1Data Analysis and Data Management with Stata (PL) , 2V1Introduction to the Stata interface (VG) , 2A21Computer Lab: Data analysis and data construction for complex survey data [Lunch in 2X6] 2V1 and 2A21Specialist topics and illustrative examples Handling coefficients (VG) Sample selected data (VG) [Coffee in 2X6] Multilevel data and analysis (PL) Handling occupational data (PL) Reminder: Scottish Social Survey Network seminar on ‘Scotland’s Large Scale Datasets’, on 24 th January 2008, University of Stirling

3 Data Analysis and Data Management with Stata 1) Background: Integrating data analysis and data management 2) Stata and data management - Lab: Some useful Stata routines / functions

4 Background: Integrating data management and data analysis By Data management we mean:  Matching data files together  ‘Cleaning’ data  Operationalising variables  Accessing and reviewing data “A programme like SPSS … has two main components: the statistical routines, that do the numerical calculations…, and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” (Procter, 2001:253).

5 Research interests, data analysis and data management (1) 1)Research-led pressures for large and complex survey data –Longitudinal surveys –Linked data projects e.g. administrative data; health data; GIS –Comparative research e.g. x-national, historical  social survey researchers enjoy access to a vast array of micro-data resources, many of which have (sometimes hidden) complexity

6 Check: what is large and complex social survey data? 1.Array of variables / operationalisations  Competing measures; interaction effects; latent variables 2.Multiple related data files  Linked component datasets  External data (e.g. aggregate and micro-data) 3.{Large volumes of cases} 4.Relations between cases 5.Multiple hierarchies of measurement 6.Multiple points of measurement  Unbalanced repeated contacts  {Censored} duration data  International comparative survey designs 7.Sample collection and weighting data

7 Example: Multiple measurement points (BHPS Unbalanced panel) WavePerson  Person-level Vars  N_w=3N_p=3

8 E.g.: array of variables and sample selection (BHPS occ data)

9 Example: Relations between cases

10 Check: Variable operationalisations? Some prescriptive advice (e.g. ONS, EU) Variable operationalisations in longitudinal research – Themes from comparative research –‘universality’ and ‘specificity’ –Importance of documentation / metadata –{See Scottish Social Survey Network seminar tomorrow 24 th Jan} –{See example on occupations this afternoon} Student’s Law: …In survey data analysis, somebody else has already struggled through the variable constructions you are working on right now… processes by which survey measures are defined and subsequently interpreted by research analysts

11 Research interests, data analysis and data management (2) 2)Availability and advocacy of complex methods of data analysis –Complex statistical approaches Multi-process models (CQeSS, Latent variable and Multilevel analysis Missing data analysis (e.g. See the SSSN Master Class programme..!! –Challenging methodological approaches Mixed methods research See esp. the ESRC NCRM ( )  Daily work of survey researchers straddle social science and statistical traditions

12 A research capacity shortfall? Concern that UK lacks sufficient trained social researchers with quantitative analytical skills Criticism that social scientists don’t sufficiently exploit empirical survey data –Insufficient impact of published analyses –Published analyses are too simple and crude –{this doesn’t really apply to economics!}  This is in some ways a puzzle, given dramatic progress in the availability of survey data (e.g. archive.ac.uk) and in resources for statistical analysiswww.data- archive.ac.uk

13 Returning to survey data management… Simple survey data management –Short recodes; selecting cases; one small data file  taught in many textbooks and reasonably widely understood by most users of SPSS, Stata, etc Complex survey data management –Matching multiple data files; complex variable operationalisations; complex relations between cases  Is rarely taught in textbooks/courses  Is usually required at some stage  Often puts off non-specialists

14 A substantial social science need for improved standards and resources in data management  In practice, social researchers often spend more time on data management than any other part of the research process  A ‘methodology’ of data management is relevant to social science literatures on ‘harmonisation’, ‘comparability’ Data access / collection Data Management Data Analysis UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs DAMES ONS support ESDS support NCRM workshops Essex summer school ESRC RDI initiatives CQeSS

15 Confronting complex data management… There are two related possibilities i.Generic resources and services for (survey) data management  Format independence  Computer science research (e-science) ii.Specialist support for key social survey data management approaches  Directed to specific software formats  Directed to specific example datasets

16 (i) DAMES – Data Management through e-Social Science ESRC National Centre for e-Social Science research Node, University of Stirling / University of Glasgow, Case studies, provision and support for data management in the social sciences 4 social science themes 1)Grid Enabled Specialist Data Environments occupations; education; ethnicity 2)Micro-simulation on social care data 3)Linking e-Health and social science databases 4)Training and interfaces for data management support Underlying computer science research themes –Linking heterogeneous and distributed data; metadata; data abstraction and data fusion; workflow modelling; data security

17 (ii) Specialist support for survey research communities –Scottish Social Survey Network –Focussed advice on smallish range of Key surveys Key variables Stata and survey data management –Stata combines extensive routines for data analysis with extensive routines for data management

18 Data Analysis and Data Management with Stata 1) Background: Integrating data analysis and data management 2) Stata and data management - Lab: Some useful Stata routines / functions

19 Stata and its competitors (1) Claim: Stata offers unparalleled convenience in combining pre-programmed data analytical and data management functionality Ease of data access, manipulation and review –Conditional processing (‘if’, ‘by’) –Succinct command syntax –Ability to read online files Exporting / saving results and graphs –Regression model outputs –Matrix manipulation of model results Development of new analytical routines –Research community posting new models (researcher driven) –Complex data estimators (svy; cluster; xt; xtmixed)

20 Stata and its competitors (2) Claim: Stata is ultimately much more powerful, but it is not always well designed Batch files / interactive syntax / programs: –Stata has more flexibility, but SPSS interactive syntax is easier (e.g. delimiters) Direct data entry / browsing –Stata is clumsy – easier to use SPSS or another package Variable and value labels and presenting outputs –SPSS quicker and better presentation; Stata needs more effort Computing / recoding / conditional processing –Stata more extensive (eg ‘by’ and ‘if’); SPSS easier to use – eg Stata won’t allow overwriting an existing variable Missing values / weighting data –Stata’s default settings cause more confusion than SPSS –Stata has some restrictions on its weights / SPSS easier Complex data estimators (svy; cluster; xt; xtmixed) –Unique and advantageous feature of Stata –But many Stata models are very slow to estimate – e.g. GLLAMM

21 Some existing resources on data management Stata’s files: LDA WebCT site worked examples of data management on complex survey data using SPSS and Stata: –‘introductory training in data analysis’ –‘longitudinal research resources’ –Model – ‘learn by doing’… Researcher input: –Importance of logging your work (‘syntax’ / ‘do’ files) –Consistent use of file paths / annotation of command files

22 Stata lab 23/1/08: illustrating integrated data management and analysis Example files from ‘Longitudinal data analysis’ –4 LDA files with extended examples –{Data (from UKDA) should be in place on machines for today}  First lab: a selective summary file  Concentrates on matching data and manipulating variables

23 Variable management in Stata Painful text value label processes.. Recoding data examples Use of ‘do’ and ‘ado’ batch files Matching with aggregate datasets Further resources on operationalising variables: see talk on ‘Handling occupational data’

24 Matching files Complex data inevitably involves more than one related data file –Multiple related files are almost inevitable with longitudinal data collections A vital data analysis skill!! –Link data between files by connecting them according to key linking variable(s) –Eg, ‘person identifier’ variable ‘pid’ –Eg : iserwww.essex.ac.uk/ulsc/bhps/doc/ See SPSS and Stata example command files within LDA Website

25 Types of file matching 1.Addition of files –E.g. two files with same variables for different people Stata: append using file2.dta SPSS: add files file=“file1.sav” /file=“file2.sav”. 2.Case-to-case matching –One-to-one link, eg two files with different sets of variables for same people STATA: merge pid using file2.dta SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid. 3.Table distribution –One-to-many link, eg one file has individuals, another has households, and match household info to the individuals STATA: merge pid using file2.dta SPSS: match files file=“file1.sav” /table=“file2.sav” /by=pid.

26 Types of file matching, ctd. 4.Aggregating –Summarise over multiple cases –Stata: - collapse (mean) inc, by(pid) or - egen avinc=mean(inc), by(pid) –SPSS: aggregate outfile=“file2.sav” /break=pid /avinc=mean(inc) –Output files from aggregate / collapse are often linked back into the micro-data from which they are derived 5.Related cases matching –Link info from one related case to another case, eg info on spouse put on own case –Stata: - merge pid using file2.dta or - joinby … –SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid.

27 File matching crib: Stata: _merge = indicator of cases present for: 1 = Master file but not input file 2 = Input file but not Master file 3 = Master and input file Remember to drop auto-generated _merge before performing next merge command