Download presentation
Presentation is loading. Please wait.
Published byAshlynn Powers Modified over 8 years ago
1
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1
2
Exploratory Data Analysis Structure of longitudinal data refers to the format of the dataset Exploration of longitudinal data refers to initial steps in the data analytic process using graphical and descriptive methods 2
3
Format of Longitudinal Data Wide Format: It is the standard structure of longitudinal data Referred as subjects-by-variables format or multivariate format Example: 3 SubjectWave i 1234Medication 1292233110 2310160 3102025860
4
Format of Longitudinal Data Wide Format: Data in wide format are used in profile analysis with repeated measures ANOVA/MANOVA Descriptive statistics, such as means, correlations (and covariances) between measurement occasions, can be computed easily when data are structured this way 4
5
Format of Longitudinal Data Long Format: Referred as univariate format The time metric is explicit in long format appearing as a time variable in its own column Linear Mixed Effects (LME) Modeling requires the data structure to be this format Graphing of individual curves, and graphs of means also requires data structure in long format 5
6
Format of Longitudinal Data Long Format: Example: 6 SubjectWavePainMedication 11290 12220 13330 14110 41231 42111 43171 44301
7
Balanced And Complete What is balance design? Balance design refers to a design in which participants are measured at the same time points; whereas, unbalanced design occurs when not all participants are measured at the same time points 7
8
Balanced And Complete What is complete data set? Complete data occurs when there are no missing data—observations that were planned, were realized; whereas, incomplete data indicates missing data (i.e., where observations were planned but not realized) 8
9
Examples: Balanced And Complete Unbalanced Design – Complete Data 9 Subject #1 AgeScore 109.2 1210.5 139.8 1612.6 Subject #2 AgeScore 910.1 1111.6 1410.8 1613.9 Subject #3 AgeScore 911.1 1012.8 1515.3 1612.2
10
Examples: Balanced And Complete Balanced Design – Incomplete Data 10 Age Subject 910111213141516 1.9.2.10.59.8..12.6 210.1.11.6..10.8.13.9 311.112.8....15.312.2
11
Examples: Balanced And Complete Unbalanced Design – Incomplete Data *It is assumed that the researcher planned to measure subject #1 yearly but other subjects every two years. 11 Age Subject 910111213141516 1*9.210.59.812.6.... 210.1NA11.6NA. 13.9 311.1NA12.8NA15.3NA12.2
12
Treatment of Imbalance and Incompleteness Data that come from an unbalanced design with missing data is sometime treated as complete and balanced if: 1. the number of waves is equal for all participants, or 2. the researcher deletes data to force an equal number of waves 12
13
Treatment of Imbalance and Incompleteness Consider incomplete data from a unbalanced design 13 Age Subject 9101112131415 19.2.10.59.8... 210.1NA11.6NA. 10.8 311.1NA12.8NA15.3NA.
14
Treatment of Imbalance and Incompleteness Suppose in analysis the successive waves of measurement were of most substantive importance rather than timing of the observations 14 Wave Subject 123 19.210.59.8 210.111.610.8 311.112.815.3
15
Treatment of Imbalance and Incompleteness In the example, chronology metric (i.e., time scale) is ignored and so is the variability in timing of observations Ignoring time scale (e.g., age) may be indefensible, especially if the scores reflect some type of developmental phenomenon that is naturally tied to time scale 15
16
Treatment of Imbalance and Incompleteness The forced complete and balanced scenario is the only choice when either repeated measures ANOVA or MANOVA is used (criticism of these methods) Longitudinal methods do not force data to be complete and balanced These methods allow the observations to be anchored to the chronology metric rather than the order in which the observations were obtained 16
17
Missing Data in LME Analysis In the LME analysis, we will ignore missing data in the long format of data structure Alternatively, any row of the long format data frame that has an NA will not be omitted When NAs occur only for the response variable, a subject is included in the LME analysis as long as they have at least one non-missing time point 17
18
Missing Data in LME Analysis When NAs occur for a static / time-invariant predictor, then the entire record of the subject is deleted, meaning the subject is omitted from the analysis In R the na.omit() function will omit any rows of a data frame that have at least one NA To illustrate na.omit(), we select the first three subjects of our long format MPLS data (MPLS.long) 18
19
Missing Data in LME Analysis 19 > MPLS.ls <- subset(MPLS.long, subid < 4, select = c(subid, read, grade, gen)) > MPLS.ls subid read grade gen 1.5 1 172 5 F 1.6 1 185 6 F 1.7 1 179 7 F 1.8 1 194 8 F 2.5 2 200 5 F 2.6 2 210 6 F 2.7 2 209 7 F 2.8 2 NA 8 F 3.5 3 191 5 M 3.6 3 199 6 M 3.7 3 203 7 M 3.8 3 215 8 M
20
Missing Data in LME Analysis Suppose we apply the na.omit() function to the MPLS.ls data frame and save the result as omit1 20 > omit1 <- na.omit(MPLS.ls) > omit1 subid read grade gen 1.5 1 172 5 F 1.6 1 185 6 F 1.7 1 179 7 F 1.8 1 194 8 F 2.5 2 200 5 F 2.6 2 210 6 F 2.7 2 209 7 F 3.5 3 191 5 M 3.6 3 199 6 M 3.7 3 203 7 M 3.8 3 215 8 M
21
Missing Data in LME Analysis Another example, suppose we induce missing values for a static predictor Let us assign NA to predictor gender, labeled as gen, for the first subject 21
22
Missing Data in LME Analysis 22 > MPLS.ls1 <- MPLS.ls > MPLS.ls1[1:4,4] <- NA > MPLS.ls1 subid read grade gen 1.5 1 172 5 1.6 1 185 6 1.7 1 179 7 1.8 1 194 8 2.5 2 200 5 F 2.6 2 210 6 F 2.7 2 209 7 F 2.8 2 NA 8 F 3.5 3 191 5 M 3.6 3 199 6 M 3.7 3 203 7 M 3.8 3 215 8 M
23
Missing Data in LME Analysis Now we apply na.omit() to the MPLS.ls1 data frame and save the result as omit2 23 > omit2 <- na.omit(MPLS.ls1) > omit2 subid read grade gen 2.5 2 200 5 F 2.6 2 210 6 F 2.7 2 209 7 F 3.5 3 191 5 M 3.6 3 199 6 M 3.7 3 203 7 M 3.8 3 215 8 M
24
Retain or Omit Missing Data Rows? In LME analysis there are two options we will consider for the long format data frame 1. the NA values can be left in the data frame or, 2. na.omit() can be used to eliminate the rows containing NA For the LME analysis, the results will be identical either way 24
25
Retain or Omit Missing Data Rows? 1. If NA values occur only for the response variable, it is recommended the missing value rows be omitted with na.omit() and the resulting data frame used in all analyses 2. If NA values occur for static predictors, it is unclear if the missing data rows should be omitted Alternatively, a subject might be retained or excluded depending on which static predictors are used 25
26
Retain or Omit Missing Data Rows? For the LME analysis, the results are most valid assuming no missing values for the static predictors Missing values are allowed on the response variable with the validity of the analysis depending on certain assumptions about the missingness, such as Missing at Random (MAR), Missing Completely at Random (MCAR) 26
27
Retain or Omit Missing Data Rows? 3. When the data frame contains more than one response variable and the pattern of missingness differs If the response variables are to be analyzed separately, it might be tolerable to have the number of subjects vary based on the response variable analyzed 27
28
Missing Data Mechanisms 28 Can be thought of as a process that acts on the unobserved complete data set to produce the incomplete observed data set Consider the measurement of a response variable at two time points Suppose there is no missing data at the first time point but there is missing data at the second time point
29
Missing Data Mechanisms 29 The missing data mechanisms are defined by whether the missing data at time 2 depend on: 1. The observed data at time 1 2. The observed data at time 2 3. The missing data at time 2 4. None of the above
30
Missing Data Mechanisms 30 MCAR: Characterized by number 4 in the list The process is completely random and unrelated to observed or missing data The incomplete observed sample is assumed to be a random sample of the unobserved complete data
31
Missing Data Mechanisms 31
32
Missing Data Mechanisms 32 Example of MCAR Process: Suppose in the MPLS study, we want to relieve the response load by obtaining only four waves of data for two cohorts of individuals Subjects are randomly selected for the cohorts and cohort 1 is measured over grades 5-8 and cohort 2 is measured over grades 6-9 Cohort56789 1XXXXNA 2 XXXX
33
Missing Data Mechanisms 33 MAR: Characterized by number 1 or number 2 in the list
34
Missing Data Mechanisms 34 Example of MAR Process is when the researcher decides not to measure some subjects at time 2 after observing their scores at time 1 This might occur if the researcher is studying methods of increasing reading scores and plans two repeated measurements Subjects who have perfect scores at time 1 will not increase over time and thus, are not invited back for the second measurement
35
Missing Data Mechanisms 35 By knowing a subject's score at time 1, we can determine if they have a missing value at time 2 Note: the prediction of missing values can be based on variables other than the response variable
36
Missing Data Mechanisms 36 NMAR: Characterized by number 3 in the list
37
Missing Data Mechanisms 37 Example of NMAR Process: Consider computer administered reading test. After the test a subject is allowed to see their reading score and then decide to retain it or delete it. A retained score is ultimately observed by the researcher whereas a deleted score is not. If for some subjects their score decreases from time 1 to time 2, they might be unhappy and deleted their time 2 score. In this case the missing
38
Missing Data Mechanisms 38 data are dependent on the observed data at time 1 but also on the missing data at time 2, as the researcher never sees the deleted subjects' time 2 scores
39
Missing Data Mechanisms 39
40
Missing Data Techniques 40 Listwise Deletion (LD): removes cases that contain missingness on any of the variables included in the study. LD can result in biased parameter estimates unless the data are MCAR (Little, & Rubin, 2002). Pairwise Deletion (PD): removes only cases that contain missingness on the variables used to obtain the target estimators. PD typically yield inconsistent statistics.
41
Missing Data Techniques 41 Last Observation Carried Forward (LOCF): missing values replaced by the last observed value among the repeated measurements. LOCF is unbiased in point estimation under MCAR or MAR, but will perform poorly in SE estimation. Multiple Imputation (MI): uses stochastic imputation models to impute (fill in) multiple “complete” datasets. The estimates of model parameters are then combined over datasets (Robin, 1978).
42
Missing Data Techniques 42 Full Information Maximum Likelihood (FIML): aims to find estimates of the parameters by maximizing the likelihood function of parameters given the observed data. Unlike imputation methods, FIML does not create “complete” datasets. Both MI and FIML yield unbiased parameter estimates under MCAR or MAR.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.