Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.

Similar presentations


Presentation on theme: "1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling."— Presentation transcript:

1 1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative

2 2 Master Class 1: Data Analysis with Stata, 23/1/08 Introductions and generic resources 1030-1100, 2V1Data Analysis and Data Management with Stata (PL) 1100-1130, 2V1Introduction to the Stata interface (VG) 1130-1300, 2A21Computer Lab: Data analysis and data construction for complex survey data [Lunch in 2X6] 2V1 and 2A21Specialist topics and illustrative examples 1400-1445Handling coefficients (VG) 1445-1515Sample selected data (VG) [Coffee in 2X6] 1545-1615Multilevel data and analysis (PL) 1615-1645Handling occupational data (PL) Reminder: Scottish Social Survey Network seminar on ‘Scotland’s Large Scale Datasets’, 1500-1700 on 24 th January 2008, University of Stirling

3 3 Data Analysis and Data Management with Stata 1) Background: Integrating data analysis and data management 2) Stata and data management - Lab: Some useful Stata routines / functions

4 4 Background: Integrating data management and data analysis By Data management we mean:  Matching data files together  ‘Cleaning’ data  Operationalising variables  Accessing and reviewing data “A programme like SPSS … has two main components: the statistical routines, that do the numerical calculations…, and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” (Procter, 2001:253).

5 5 Research interests, data analysis and data management (1) 1)Research-led pressures for large and complex survey data –Longitudinal surveys –Linked data projects e.g. administrative data; health data; GIS –Comparative research e.g. x-national, historical  social survey researchers enjoy access to a vast array of micro-data resources, many of which have (sometimes hidden) complexity

6 6 Check: what is large and complex social survey data? 1.Array of variables / operationalisations  Competing measures; interaction effects; latent variables 2.Multiple related data files  Linked component datasets  External data (e.g. aggregate and micro-data) 3.{Large volumes of cases} 4.Relations between cases 5.Multiple hierarchies of measurement 6.Multiple points of measurement  Unbalanced repeated contacts  {Censored} duration data  International comparative survey designs 7.Sample collection and weighting data

7 7 Example: Multiple measurement points (BHPS Unbalanced panel) WavePerson  Person-level Vars  11138136 1223420 13269- 21139138 22235116 31140136 322 118 33289- N_w=3N_p=3

8 8 E.g.: array of variables and sample selection (BHPS occ data)

9 9 Example: Relations between cases

10 10 Check: Variable operationalisations? Some prescriptive advice (e.g. ONS, EU) Variable operationalisations in longitudinal research – http://www.longitudinal.stir.ac.uk/variables/http://www.longitudinal.stir.ac.uk/variables/ Themes from comparative research –‘universality’ and ‘specificity’ –Importance of documentation / metadata –{See Scottish Social Survey Network seminar tomorrow 24 th Jan} –{See example on occupations this afternoon} Student’s Law: …In survey data analysis, somebody else has already struggled through the variable constructions you are working on right now… processes by which survey measures are defined and subsequently interpreted by research analysts

11 11 Research interests, data analysis and data management (2) 2)Availability and advocacy of complex methods of data analysis –Complex statistical approaches Multi-process models (CQeSS, http://e-science.lancs.ac.uk/cqess/)http://e-science.lancs.ac.uk/cqess/ Latent variable and Multilevel analysis Missing data analysis (e.g. www.missingdata.org.uk)www.missingdata.org.uk See the SSSN Master Class programme..!! –Challenging methodological approaches Mixed methods research See esp. the ESRC NCRM (http://www.ncrm.ac.uk/ )http://www.ncrm.ac.uk/  Daily work of survey researchers straddle social science and statistical traditions

12 12 A research capacity shortfall? Concern that UK lacks sufficient trained social researchers with quantitative analytical skills Criticism that social scientists don’t sufficiently exploit empirical survey data –Insufficient impact of published analyses –Published analyses are too simple and crude –{this doesn’t really apply to economics!}  This is in some ways a puzzle, given dramatic progress in the availability of survey data (e.g. www.data- archive.ac.uk) and in resources for statistical analysiswww.data- archive.ac.uk

13 13 Returning to survey data management… Simple survey data management –Short recodes; selecting cases; one small data file  taught in many textbooks and reasonably widely understood by most users of SPSS, Stata, etc Complex survey data management –Matching multiple data files; complex variable operationalisations; complex relations between cases  Is rarely taught in textbooks/courses  Is usually required at some stage  Often puts off non-specialists

14 14 A substantial social science need for improved standards and resources in data management  In practice, social researchers often spend more time on data management than any other part of the research process  A ‘methodology’ of data management is relevant to social science literatures on ‘harmonisation’, ‘comparability’ Data access / collection Data Management Data Analysis UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs DAMES ONS support ESDS support NCRM workshops Essex summer school ESRC RDI initiatives CQeSS

15 15 Confronting complex data management… There are two related possibilities i.Generic resources and services for (survey) data management  Format independence  Computer science research (e-science) ii.Specialist support for key social survey data management approaches  Directed to specific software formats  Directed to specific example datasets

16 16 (i) DAMES – Data Management through e-Social Science ESRC National Centre for e-Social Science research Node, University of Stirling / University of Glasgow, 2008-2011 Case studies, provision and support for data management in the social sciences 4 social science themes 1)Grid Enabled Specialist Data Environments occupations; education; ethnicity 2)Micro-simulation on social care data 3)Linking e-Health and social science databases 4)Training and interfaces for data management support Underlying computer science research themes –Linking heterogeneous and distributed data; metadata; data abstraction and data fusion; workflow modelling; data security

17 17 (ii) Specialist support for survey research communities –Scottish Social Survey Network –Focussed advice on smallish range of Key surveys Key variables Stata and survey data management –Stata combines extensive routines for data analysis with extensive routines for data management

18 18 Data Analysis and Data Management with Stata 1) Background: Integrating data analysis and data management 2) Stata and data management - Lab: Some useful Stata routines / functions

19 19 Stata and its competitors (1) Claim: Stata offers unparalleled convenience in combining pre-programmed data analytical and data management functionality Ease of data access, manipulation and review –Conditional processing (‘if’, ‘by’) –Succinct command syntax –Ability to read online files Exporting / saving results and graphs –Regression model outputs –Matrix manipulation of model results Development of new analytical routines –Research community posting new models (researcher driven) –Complex data estimators (svy; cluster; xt; xtmixed)

20 20 Stata and its competitors (2) Claim: Stata is ultimately much more powerful, but it is not always well designed Batch files / interactive syntax / programs: –Stata has more flexibility, but SPSS interactive syntax is easier (e.g. delimiters) Direct data entry / browsing –Stata is clumsy – easier to use SPSS or another package Variable and value labels and presenting outputs –SPSS quicker and better presentation; Stata needs more effort Computing / recoding / conditional processing –Stata more extensive (eg ‘by’ and ‘if’); SPSS easier to use – eg Stata won’t allow overwriting an existing variable Missing values / weighting data –Stata’s default settings cause more confusion than SPSS –Stata has some restrictions on its weights / SPSS easier Complex data estimators (svy; cluster; xt; xtmixed) –Unique and advantageous feature of Stata –But many Stata models are very slow to estimate – e.g. GLLAMM

21 21 Some existing resources on data management Stata’s files: http://www.stata.com/support/faqs/data/http://www.stata.com/support/faqs/data/ LDA WebCT site www.longitudinal.stir.ac.uk, worked examples of data management on complex survey data using SPSS and Stata:www.longitudinal.stir.ac.uk –‘introductory training in data analysis’ –‘longitudinal research resources’ –Model – ‘learn by doing’… Researcher input: –Importance of logging your work (‘syntax’ / ‘do’ files) –Consistent use of file paths / annotation of command files

22 22 Stata lab 23/1/08: illustrating integrated data management and analysis Example files from ‘Longitudinal data analysis’ www.longitudinal.stir.ac.ukwww.longitudinal.stir.ac.uk –4 LDA files with extended examples –{Data (from UKDA) should be in place on machines for today}  First lab: a selective summary file  Concentrates on matching data and manipulating variables

23 23 Variable management in Stata Painful text value label processes.. Recoding data examples Use of ‘do’ and ‘ado’ batch files Matching with aggregate datasets Further resources on operationalising variables: see talk on ‘Handling occupational data’

24 24 Matching files Complex data inevitably involves more than one related data file –Multiple related files are almost inevitable with longitudinal data collections A vital data analysis skill!! –Link data between files by connecting them according to key linking variable(s) –Eg, ‘person identifier’ variable ‘pid’ –Eg : iserwww.essex.ac.uk/ulsc/bhps/doc/ See SPSS and Stata example command files within LDA Website

25 25 Types of file matching 1.Addition of files –E.g. two files with same variables for different people Stata: append using file2.dta SPSS: add files file=“file1.sav” /file=“file2.sav”. 2.Case-to-case matching –One-to-one link, eg two files with different sets of variables for same people STATA: merge pid using file2.dta SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid. 3.Table distribution –One-to-many link, eg one file has individuals, another has households, and match household info to the individuals STATA: merge pid using file2.dta SPSS: match files file=“file1.sav” /table=“file2.sav” /by=pid.

26 26 Types of file matching, ctd. 4.Aggregating –Summarise over multiple cases –Stata: - collapse (mean) inc, by(pid) or - egen avinc=mean(inc), by(pid) –SPSS: aggregate outfile=“file2.sav” /break=pid /avinc=mean(inc) –Output files from aggregate / collapse are often linked back into the micro-data from which they are derived 5.Related cases matching –Link info from one related case to another case, eg info on spouse put on own case –Stata: - merge pid using file2.dta or - joinby … –SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid.

27 27 File matching crib: Stata: _merge = indicator of cases present for: 1 = Master file but not input file 2 = Input file but not Master file 3 = Master and input file Remember to drop auto-generated _merge before performing next merge command


Download ppt "1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling."

Similar presentations


Ads by Google