: LSS1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, Data and Variable Management Paul Lambert
: LSS2 Data Management for Longitudinal Data 1. The Nature of ‘Large and Complex’ Data 2. Data management & STATA – getting started 3. Longitudinal Data Types 4. Merging Datasets
: LSS3 The nature of ‘large and complex’ longitudinal resources: complicating the variable by case matrix Cases Variables A B C A N
: LSS4 Large and complex = Complexity in: Multiple hierarchies of measurement Array of variables / operationalisations Relations between / subgroups of cases Multiple points of measurement –Balanced or unbalanced repeated contacts –Censored duration data Sample collection and weighting
: LSS5 i) Multiple hierarchies (levels) of measurement n Common examples: Both individuals and households Schools and pupils People and local districts and regions n Solutions: Separate VxC matrix for each level, eg BHPS Merged VxC matrix at lowest level
6 Illustration: Hierarchical dataset ClusterPerson Person-level Vars n1=3n2=8
: LSS7 ii) Array of variables n Vast number of variable responses, eg 1K+ Recoding multiplies these up, eg dummies Multiple response var.s (‘all that apply’) Categorisations / indexes (eg occupations) n Implication: Either separate files for separate var. groups Or very long and difficult files…
: LSS8 iii) Relations between cases n All respondents in a household n Husbands and wives both sampled n Fellow school pupils sampled n Longitudinal: differing relations with others at different times n Outcomes: Link information between related cases
: LSS9 iv) Multiple measurement points n Longitudinal: information on same cases for multiple time points Panel or cohort: several records via repeated contact for each individual Problems of ‘unbalanced’ panels Life history / retrospective: Durations in spells: multistate / multiepisode, overlapping spells; time varying covariates Left or right censoring of durations in spells
: LSS10 v) Sample collection / weighting n Multistage cluster particularly popular n Sample may have been clustered, stratified n Longitudinal: uneven inclusion of cases over time n Sample weights designed to solve, but: Complex in application Not suited to all applications
: LSS11 Data Management for Longitudinal Data 1. The Nature of ‘Large and Complex’ Data 2. Data management & STATA – getting started 3. Longitudinal Data Types 4. Merging Datasets
: LSS12 STATA data management examples: see datmanag_part1.do Claim: For data management, STATA is powerful, but not always well designed n Batch files / interactive syntax / programs n Data entry / browsing n Variable labels n Computing / recoding n Missing values n Weighting data n Survey estimators (svy)
: LSS13 Data Management for Longitudinal Data 1. The Nature of ‘Large and Complex’ Data 2. Data management & STATA – getting started 3. Longitudinal Data Types 4. Merging Datasets
: LSS14 Typology of longitudinal data files n 3 Sets of contrasts : 1. Repeated X-section / Panel / Cohort Event History / Time Series 2. Wide v’s Long 3. Discrete v’s Continuous time See datmanag_part 2.do
: LSS15 Contrast 1 Type A: Repeated x-sect data SurveyPerson Person-level Vars N_s=3N_c=8
: LSS16 C1 Type B: Panel dataset (Unbalanced) CasesYear Variables n1=3n2=8
: LSS17 C1 Type C : Event history data analysis n Alternative data sources: Panel / cohort (more reliable) Retrospective (cheaper, but recall errors) n Aka: ‘Survival data analysis’; ‘Failure time analysis’; ‘hazards’; ‘risks’;.. Focus shifts to length of time in a ‘state’ - analyses determinants of time in state
: LSS18 Key to event histories is ‘state space’
: LSS19 C1 Type D: Time series data **Exact equivalence to panel data format Examples: n Unemployment rates by year in UK n University entrance rates by year by country Statistical summary of one particular concept, collected at repeated time points from one or more subjects
: LSS20 Contrast 2: ‘Wide’ versus ‘Long’ format Relevant to all types of dataset: ‘Wide’ = 1 case per record (person), additional vars for time points : Person 1 Sex YoB Var1_92 Var1_93 Var1_94 … Person 2 … ‘Long’ = 1 case per time point within person (as panel data example) STATA: ‘reshape’ command allows transfer between the two formats
: LSS21 Contrast 3: Continuous v’s Discrete time Primarily in terms of event history datasets n Continuous time (‘spell files’, ‘event oriented’) One episode per case, time in case is a variable n Discrete time One episode per time unit, type of event and event occurrence as variables n Analyses: Most packages can handle either format comfortably
: LSS22
: LSS23
: LSS24 Data Management for Longitudinal Data 1. The Nature of ‘Large and Complex’ Data 2. Data management & STATA – getting started 3. Longitudinal Data Types 4. Merging Datasets
: LSS25 Matching files n Complex data inevitably involves more than one related data file n A vital data analysis skill!! n Link data between files by connecting them according to key linking variable(s) n Eg, ‘person identifier’ variable ‘pid’ n Eg : See datmanag_part3.do
: LSS26 Types of file matching n Case-to-case matching One-to-one link, eg two files with different sets of variables for same people STATA: append or merge n Table distribution One-to-many link, eg one file has individuals, another has households, and match household info to the individuals STATA: merge
: LSS27 Types of file matching ctd n Aggregating Summarise over multiple cases then link summaries back to cases STATA: collapse n Related cases matching Link info from one related case to another case, eg info on spouse put on own case STATA: merge or joinby
: LSS28 STATA file matching crib: _merge = indicator of cases present for: 1 = Master file but not input file 2 = Input file but not Master file 3 = Master and input file