Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003
Outline Two Topics Aggregation: a review of the CTS Finding the ‘smoking gun’: a review of variables and CTUMS
Aggregation In the 2002 Atlantic DLI workshop, we spent some time examining the importance of the unit of analysis in defining statistical data structure.
Unit of Analysis The unit of analysis is the object(s) about which data have been collected and about which generalizations are being sought.
Each member of the unit of analysis is a separate row in the data structure. Statistical Data Structure
Case 1 Case 2 Case 3 * Case n * * Statistical Data Structure
All of the information collected for each member of the unit of analysis is organized in a fixed location in the file called variables. Statistical Data Structure
Case 1 Case 2 Case 3 * Case n * * Variable 1 * Variable 2Variable 3 * Variable k-1 Variable k Statistical Data Structure
Case 1 Case 2 Case 3 * Case n * * Variable 1 * Variable 2Variable 3 * Variable k-1 Variable k Statistical Data Structure
Canadian Travel Survey Our exercise last year used two of four files from the Canadian Travel Survey: the person file and trip file.
Canadian Travel Survey Our assignment last year was to link information from the trip file about the respondents’ modes of transportation with information about the traveller in the person file.
Trip Microdata File Person Non-travellers Travellers Trip Key: linkable via UNIQID
Canadian Travel Survey The data management problem was finding a way to share the information from one person who took many trips with a single record in the person file for this individual.
Canadian Travel Survey For the person who only took one trip, the match between the trip and person file is one to one. Person File Uniqid = 1 Tottrip = 1 Trip File Uniqid = 1 Tripnum = 1
Canadian Travel Survey For the person who took two or more trips, the match was a many to one between the trip and person file. Person File Uniqid = 19 Tottrip = 2 Trip File Uniqid = 19 Tripnum = 1 Uniqid = 19 Tripnum = 2
Canadian Travel Survey Our strategy was to summarize the mode information in the trip file for each traveller to create a one-to-one match between the summarized trip file and the person file.
Canadian Travel Survey That is, we needed to summarize the two trips for respondent 19 into one record for this person while capturing the mode of travel information. Person File Uniqid = 19 Tottrip = 2 Trip File Uniqid = 19 Tripnum = 1 Uniqid = 19 Tripnum = 2
Canadian Travel Survey This summary strategy relied on aggregating over trips to produce one record per respondent in the trip file. More about this in a minute.
Canadian Travel Survey The first step was to read the raw data from the person file into SPSS, to sort the cases by UNIQID, and to write this to a.sav file. Person File UNIQID
Canadian Travel Survey The first step was to read the raw data from the person file into SPSS, to sort the cases by UNIQID, and to write this to a.sav file. Person File UNIQID
Canadian Travel Survey The next step was to read the raw data from the trip file, to sort the cases by UNIQID & TRIPNUM, and to save this file. Trip File UNIQID TRIPNUM
Canadian Travel Survey The next step was to read the raw data from the trip file, to sort the cases by UNIQID & TRIPNUM, and to save this file. Trip File UNIQID TRIPNUM
Canadian Travel Survey How do we summarize the modes of transportation in the trip file?
Canadian Travel Survey
The strategy was to convert the six categories of mode into six variables where each category of travel mode was represented by one of these new variables.
Canadian Travel Survey Mode 1. Car 2. Air 3. Bus 4. Rail 5. Boat 6. Other Car Air Bus Rail Boat Other
Canadian Travel Survey For each trip, the value of the mode variable was used to assign a value of one to the variable representing this mode of travel.
Canadian Travel Survey UniqidTripnumModeCarAirBusRailBoatOther
Canadian Travel Survey After creating these six new variables for each trip in the trip file, the next step was to add within each unique id the number of trips taken using each of the six modes of transportation.
Canadian Travel Survey UniqidTripnumModeCarAirBusRailBoatOther UniqidCarAirBusRailBoatOther
Canadian Travel Survey The output from the aggregate command resulted in a new file with one record for each UNIQID. This data structure then matched the one record for each UNIQID in the person file.
Canadian Travel Survey This new aggregate trip file was then merged with the person file to pass the mode of transportation information to the person file.
Canadian Travel Survey Now case 1157 has only one record to match in the new trip file with the person file. Person File Uniqid = 1157 Tottrip = 3 New Trip File Uniqid = 1157 Car = 2 Air = 1 Bus = 0 Rail = 0 Boat = 0 Other = 0
Aggregate The Aggregate procedure sorts all of the cases by a grouping variable (called the break variable) and then creates a new data file containing a case for each unique value in this grouping variable.
Aggregate The variables in this new file are created by assigning summary functions to the variables in the original file.
Aggregate
Why all this emphasis on Aggregate? We will be using the aggregate command in SPSS with the Canadian Community Health Survey tomorrow to summarize information at the person level to the level of health regions in Atlantic Canada.
Outline Two Topics Aggregation: a review of the CTS Finding the ‘smoking gun’: a review of variables and CTUMS
The Smoking Gun For the remainder of this session, we will explore a range of topics related to variables using content from the Canadian Tobacco Use Monitoring Survey (CTUMS).
Variables One might say that variables represent the ‘smoking gun’ of research data. Somewhere in a variable is the answer to a who-done-it mystery of a research project.
Variables Variables are the content vessels in research. They carry the information associated with the unit of analysis discussed earlier. As carriers of content, variables act as organizational instruments in research.
Instruments of Organization Variables help organize the content of research in two contexts. Data Analysis
Data and Analysis The use of variables differs somewhat in each of these contexts. As a result, variables serve different purposes and can be grouped into different classes.
Data and Analysis Research Data Analysis
Data and Analysis The vocabularies of data and analysis use different labels for the various functions that variables perform. Let’s look at each category separately to understand these differences better.
Variables and Data In the building of data files, variables can be classified into three general categories. Administrative Observed Derived
Administrative Variables Administrative variables are those that data producers include to describe characteristics of: the administration of the survey, the survey design, and the record management used with the original questionnaires
Administrative Variables The types of variables that are created as a record of administering the survey include the date and time when the interview was conducted, the identification of the interviewer, the number of call-backs before the interview was completed, etc.
Administrative Variables The types of variables that are created to reflect the survey design will include information about the strata in a stratified sample design, geographic identification in a cluster design, and weight variables for estimating populations.
Administrative Variables The types of variables that are created as part of the record management system include unique identification numbers for each respondent, project numbers for cycles, membership in panels, linkage identification with other files, etc.
Observed Variables Observed variables are those that are created from the answers given by respondents to the items in a survey’s questionnaire.
Derived Variables Derived variables are those that are created by the data producer from variables that were observed or from contextual information that was added.
Variables and Analysis For analysis purposes, variables tend to be grouped according to analytic technique. There are two general categories of analytic techniques. Categorical Analytic
Categorical Variables Categorical statistical techniques use variables employing a nominal level of measurement, that is, numbers are assigned to represent categories. These techniques focus on tables and methods that model frequencies (e.g., log-linear analysis).
Analytic Variables Analytic statistical techniques use variables employing an ordinal, interval or ratio level of measurement. These techniques focus on the means and standard deviations of variables or correlations and covariances among groups of variables.
Modeling Language Categorical and analytic variables can both be used with statistical modeling techniques. Modeling introduces new names for variables.
Modeling Language Dependent variables: these are variables that are seen to be caused or predicted by other variables in a model. They are said to depend on the values of other variables.
Modeling Language Independent variables: these are variables that are seen to be the causal agents in a model. They are the variables that determine the response in the dependent variable.
Modeling Language Dummy variables: these are variables that are used in analytic statistical techniques to represent categorical information. Each dummy variable represents one of the values from a categorical variable. The coding of modes of transportation employed a dummy variable coding scheme.
Modeling Language Latent variables: These are variables in a causal model that are not directly observed or measured. Instead, variables serving as indicators of the latent concept are included in the model.
Modeling Language Manifest variables: These are variables in a causal model that have been directly measured.
Combining Data & Analysis Looking at the two classifications of variables between data and analysis, there are some combinations that are natural. Observed and derived variables are often categorical and analytic.
Combining Data & Analysis Administrative variables are not often used in analytic techniques but can be used to identify groups of cases to study subpopulations or to group cases for comparative techniques.