Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data File Structure and Content Joe Larson 5 / 6 / 09.

Similar presentations


Presentation on theme: "Data File Structure and Content Joe Larson 5 / 6 / 09."— Presentation transcript:

1 Data File Structure and Content Joe Larson 5 / 6 / 09

2 Outline What’s in a Data Set? What’s in a Data Set? - File Setup - Key Variables Data Conventions Data Conventions Fun With Demographics Fun With Demographics

3 What’s in a Data Set?

4 File Setup Data on the web is broken up into the forms it was collected on. Data on the web is broken up into the forms it was collected on. Different forms can have different collection time(s) and different participant subgroups Different forms can have different collection time(s) and different participant subgroups

5 Available Data is Broken up by Form All data on the web is arranged by form All data on the web is arranged by formExceptions: - Outcomes file - Demographics file Variables within a data set are in the order of the questionnaire, with any computed variables at the end of the file Variables within a data set are in the order of the questionnaire, with any computed variables at the end of the file

6 Available Data is Broken up by Form

7 Different Forms…Different Participants…Different Times Forms collected only once result in a file with one record per person Forms collected only once result in a file with one record per person Forms collected numerous times throughout follow-up result in a file with multiple records per person Forms collected numerous times throughout follow-up result in a file with multiple records per person Some data is only available for specific groups of participants (i.e. DM Only, blood subsample, etc.) Some data is only available for specific groups of participants (i.e. DM Only, blood subsample, etc.) Specifics for an individual file can be found in its corresponding data dictionary Specifics for an individual file can be found in its corresponding data dictionary

8 Example from Form 80

9 Key Variables Some variables are found in every file (with the exceptions of the demographics and outcomes files) Some variables are found in every file (with the exceptions of the demographics and outcomes files) - ID - Days since randomization/enrollment - Visit type / Visit number - Form closest to visit - Expected for visit

10 Key Variables Let’s take a look at actual Form 80 File Let’s take a look at actual Form 80 File

11 WHI Participant ID (ID)

12 Participant ID (ID) The ID variable is common to all of the web files. The ID variable is common to all of the web files. Completely independent of the member ID that is used at the individual clinics. Completely independent of the member ID that is used at the individual clinics. Also independent of the Public and blood draw IDs. Also independent of the Public and blood draw IDs.

13 Days Since Randomization / Enrollment (F80DAYS)

14 We do not give out actual dates for forms or events. We do not give out actual dates for forms or events. Time is calculated between randomization (CT) or enrollment (OS) and the form date. Time is calculated between randomization (CT) or enrollment (OS) and the form date.

15 Visit Type (F80VTYP) & Visit Number (F80VNUM)

16 These variables combine to let you know when data was collected. These variables combine to let you know when data was collected. For example, in the second line of the data on the previous slide we can see that the record is for “Annual Visit 3”. This matches up well with the 1189 days since randomization For example, in the second line of the data on the previous slide we can see that the record is for “Annual Visit 3”. This matches up well with the 1189 days since randomization

17 Closest to Visit Within Visit Type and Number (F80VCLO)

18 On rare occasions multiple forms were filled out or entered for the same participant at the same follow-up visit On rare occasions multiple forms were filled out or entered for the same participant at the same follow-up visit This variable identifies the visit closest to the actual date. For example, a year 1 annual visit with a value of “Yes” for VCLO will be the year 1 visit that is closest to 365 days from randomization/enrollment This variable identifies the visit closest to the actual date. For example, a year 1 annual visit with a value of “Yes” for VCLO will be the year 1 visit that is closest to 365 days from randomization/enrollment

19 Expected for Visit (F80EXPC)

20 Sometimes forms are filled out by participants who should not be filling them out Sometimes forms are filled out by participants who should not be filling them out The expected for visit flag identifies data that were expected by protocol The expected for visit flag identifies data that were expected by protocol

21 File Setup / Key Variables Files are arranged by form on the web at www.whiops.org Files are arranged by form on the web at www.whiops.org www.whiops.org File structure and participant group varies by form and is in the data dictionary File structure and participant group varies by form and is in the data dictionary ID, Visit Type, and other important variables can be found at the start of each file ID, Visit Type, and other important variables can be found at the start of each file

22 Any Questions?

23 Data Conventions Skip patterns Skip patterns Mark all that apply Mark all that apply Version differences Version differences

24 Skip Patterns Questions within a form are often set up with a hierarchical structure with parent questions and subquestions In most cases, the sub-questions are set to missing if the parent value indicates the sub- questions should not be answered. This is the application of a skip pattern In a few cases where the error percentage is high, the skip pattern is not applied

25 Example: Skip Pattern Applied PetDogCatBirdFishOther 101101 0 001000 10000 1PetDogCatBirdFishOther101101 0 0 1 Skip pattern QA applied Sub-questions Error Percentage < 1%

26 Example: Skip Pattern Not Applied Error Percentage ~ 6-12%

27 If the Skip Pattern is not Applied It will be in the data dictionary It will be in the data dictionary

28 Mark All That Apply 12345 0 11 01 What kind of pet do you have? (mark all that apply) Dog(s) Cat(s) Bird(s) Fish Other One question with multiple choices is converted to separate indicator variables of 0’s and 1’s

29 OrderQuestion Question Number Value Value Description 17 Do you have a pet 111Yes 18Dog11.1 19Cat11.12Marked 20Bird11.13Marked 21Fish11.1 22Other11.15Marked O17O18O19O20O21O22101101 Mark all conversion

30 Version Issues Sometimes questions are not asked on all versions of a form, leading to higher percentages of missing data Sometimes questions are not asked on all versions of a form, leading to higher percentages of missing data The Data Dictionary will have this The Data Dictionary will have this

31 Data Conventions Some cleaning was done to the data before it reached the web Some cleaning was done to the data before it reached the web Skip patterns and mark-all-that-apply conversions were usually done Skip patterns and mark-all-that-apply conversions were usually done Sometimes questions were not collected on all versions of a form Sometimes questions were not collected on all versions of a form In all cases, any issues are documented in the data dictionary In all cases, any issues are documented in the data dictionary

32 Any Questions?

33 Fun With Demographics

34 The Demographics File The demographics file is the glue that pulls most analyses together The demographics file is the glue that pulls most analyses together It contains important variables that are used in just about every analysis It contains important variables that are used in just about every analysis The file has one record per person The file has one record per person

35 Trial Participation Flags

36 Trial Flags distinguish what part of the WHI a participant is in Trial Flags distinguish what part of the WHI a participant is in In addition to CT and OS indicators, there are indicator variables for each clinical trial component In addition to CT and OS indicators, there are indicator variables for each clinical trial component

37 Basic Demographic Data

38 Including age, ethnicity, education, and income can be found here Including age, ethnicity, education, and income can be found here Because clinical center data has not been released, the “U.S. Region” variable is the best variable to use for geographic location Because clinical center data has not been released, the “U.S. Region” variable is the best variable to use for geographic location

39 Trial Arms

40 These are the key variables for any analysis on the clinical trial. These are the key variables for any analysis on the clinical trial. The hormone arm variable can also be used to separate out participants in the two hormone trials The hormone arm variable can also be used to separate out participants in the two hormone trials

41 Days from CT to CaD Randomization

42 Key variable used to determine how far a follow-up visit is from CaD randomization Key variable used to determine how far a follow-up visit is from CaD randomization To determine days from CaD randomization To determine days from CaD randomization - Start with the days from CT randomization - Subtract the days from CT to CaD randomization

43 BMD Subsample Indicator

44 A ‘yes’ response indicates that the participant was at one of the three BMD clinics A ‘yes’ response indicates that the participant was at one of the three BMD clinics

45 Fun With Demographics The demographics file is a key file used in most analyses The demographics file is a key file used in most analyses It includes trial participation and treatment status variables, as well as basic demographic data It includes trial participation and treatment status variables, as well as basic demographic data

46 Questions?

47 Stay Tuned Later I’ll be doing a beginning to end example: Later I’ll be doing a beginning to end example: - Going to the web - Hunting down variables - Downloading the data - Loading it into SAS - Merging files together - Running some basic frequencies And taking questions while I do it! And taking questions while I do it!

48 Thanks and Good Night


Download ppt "Data File Structure and Content Joe Larson 5 / 6 / 09."

Similar presentations


Ads by Google