Data Preparation and Description Chapter 16 Data Preparation and Description This chapter presents the steps necessary to prepare data for data analysis.
Learning Objectives Understand . . . importance of editing the collected raw data to detect errors and omissions how coding is used to assign number and other symbols to answers and to categorize responses use of content analysis to interpret and summarize open questions
Learning Objectives Understand . . . problems and solutions for “don’t know” responses and handling missing data options for data entry and manipulation
Exhibit 16-1 Data Preparation in the Research Process Once the data begin to flow, a researcher’s attention turns to data analysis. This chapter focuses on the first phases of that process, data preparation and description. Data preparation includes editing, coding, and data entry and is the activity that ensures the accuracy of the data and their conversion from raw form to reduced and classified forms that are more appropriate for analysis. Preparing a descriptive statistic summary is another preliminary step that allows data entry errors to be identified and corrected. Exhibit 16-1 reflects the steps in this phase.
Editing Accurate Consistent Criteria Arranged for simplification Uniformly entered The customary first step in analysis is to edit the raw data. Editing detects errors and omissions, corrects them when possible, and certifies the maximum data quality standards are achieved. The purpose is to guarantee that data are accurate, consistent with the intent of the question and other information in the survey, uniformly centered, complete, and arranged to simplify coding and tabulation. Complete
Field Editing Field editing review Entry gaps identified Callbacks made Validate results In large projects, field editing review is a responsibility of the field supervisor. It should be done soon after the data have been collected. During the stress of data collection, data collectors often use ad hoc abbreviations and special symbols. If the forms are not completed soon, the field interviewer may not recall what the respondent said. Therefore, reporting forms should be reviewed regularly. When entry gaps are present, a callback should be made rather than guessing what the respondent probably said. The field supervisor also validates field results by reinterviewing some percentage of the respondents on some questions to verify that they have participated. Ten percent is the typical amount used in data validation. In this ad, Western Wats, a data collection specialist reminds us that speed without accuracy won’t help a marketing decision maker choose the right direction.
Central Editing Be familiar with instructions given to interviewers and coders Do not destroy the original entry Make all editing entries identifiable and in standardized form At this point, the data should get a thorough editing. For a small study, a single editor will produce maximum consistency. For large studies, editing tasks should be allocated by sections. Sometimes it is obvious that an entry is incorrect and the editor may be able to detect the proper answer by reviewing other information in the data set. This should only be done when the correct answer is obvious. If an answer given is inappropriate, the editor can replace it with a no answer or unknown. The editor can also detect instances of armchair interviewing, fake interviews, during this phase. This is easiest to spot with open-ended questions. Initial all answers changed or supplied Place initials and date of editing on each instrument completed
Exhibit 16-2 Sample Codebook Coding involves assigning numbers or other symbols to answers so that the responses can be grouped into a limited number of categories. In coding, categories are the partitions of a data set of a given variable. For instance, if the variable is gender, the categories are male and female. Categorization is the process of using rules to partition a body of data. Both closed and open questions must be coded. Numeric coding simplifies the researcher’s task in converting a nominal variable like gender to a “dummy variable.” A codebook contains each variable in the study and specifies the application of coding rules to the variable. It is used by the researcher or research staff to promote more accurate and more efficient data entry. It is the definitive source for locating the positions of variables in the data file during analysis.
Exhibit 16-3 Precoding Precoding means assigning codebook codes to variables in a study and recording them on the questionnaire. It is helpful for manual data entry because it makes the step of completing a data entry coding sheet unnecessary. With a precoded instrument, the codes for variable categories are accessible directly from the questionnaire.
Exhibit 16-3 Coding Open-Ended Questions One of the primary reasons for using open-ended questions is that insufficient information or lack of a hypothesis may prohibit preparing response categories in advance. Researchers are forced to categorize responses after the data area collected. In Exhibit 16-3, question 6 illustrates the use of an open-ended question. After preliminary evaluation, response categories were created for that item. They can be seen in the codebook.
Coding Rules Appropriate to the research problem Exhaustive Categories should be Appropriateness is determined at two levels: 1) the best partitioning of the data for testing hypotheses and showing relationships and 2) the availability of comparison data. Researchers often add an “other” option to a measurement question because they know they cannot anticipate all possible answers. The need for a category set to follow a single classification principle means that every option in the category set is defined in terms of one concept or construct. Mutually exclusive Derived from one classification principle
QSR’s XSight software for content analysis. Content analysis measures the semantic content or the what aspect of a message. It is used for open-ended questions. QSR’s XSight software allows the researcher to develop different categories for analysis without losing the verbatims that may be crucial to an advertising, PR, packaging, or product development effort. QSR, the company that provided us with N6, the latest version of NUD*IST, and N-VIVO, introduced a commercial version of the content analysis software in 2004, XSight. XSight was developed for and with the input of researchers. www.qsrinternational.com QSR’s XSight software for content analysis.
Types of Content Analysis Syntactical Referential Propositional Content analysis follows a systematic process for coding and drawing inferences from texts. It starts by determining which units of data will be analyzed. In written or verbal texts, data units are of four types. Each unit type is the basis for coding texts into mutually exclusive categories. Syntactical units can be words, phrases, sentences, or paragraphs. Referential units are described by words, phrases, and sentences and may be objects, events, persons, etc. Propositional units are assertions about an object, event, or person. Thematic units are topics contained within and across texts. Georgia-Pacific launched the “Do you know a Brawny Man?” essay contest and used content analysis to define the traits of the icon. As a result, the company replaced the old “Brawny Man” with a dark-haired, clean-shaven, sensitive male. Thematic
Exhibit 16-4 & 16-5 Open-Question Coding Locus of Responsibility Mentioned Not Mentioned A. Company ________________________ B. Customer C. Joint Company-Customer F. Other Locus of Responsibility Frequency (n = 100) A. Management 1. Sales manager 2. Sales process 3. Other 4. No action area identified B. Management 1. Training C. Customer 1. Buying processes 2. Other 3. No action area identified D. Environmental conditions E. Technology F. Other 10 20 7 3 15 12 8 5
Exhbit 16-7 Handling “Don’t Know” Responses Question: Do you have a productive relationship with your present salesperson? Years of Purchasing Yes No Don’t Know Less than 1 year 10% 40% 38% 1 – 3 years 30 32 4 years or more 60 Total 100% n = 650 100% n = 150 100% n = 200 When the number of “don’t know” (DK) responses is low, it is not a problem. However, if there are several given, it may mean that the question was poorly designed, too sensitive, or too challenging for the respondent. The best way to deal with undesired DK answers is to design better questions at the beginning. If DK response is legitimate, it should be kept as a separate reply category.
Data Entry Keyboarding Database Programs Digital/ Barcodes Optical Recognition Data entry converts information gathered by secondary or primary methods to a medium for viewing and manipulation. Keyboarding remains the primary method. However, new methods are making data entry more efficient. Voice recognition
Missing Data Listwise Deletion Pairwise Deletion Replacement Missing data are information from a participant or case that is not available for one or more variables of interest. Missing data typically occur in surveys when respondents accidentally skip, refuse to answer, or do not know the answer to an item on the questionnaire. In these situations, it is also referred to as item non-response. Missing data can also be caused by researcher error and corrupted data files. In the spreadsheet screen shot in the slide, missing data are noted by a 9 in the cell. There are three basic types of techniques for dealing with missing data: 1) listwise deletion, 2) pair-wise deletion, and 3) replacement of missing values with estimated scores. Listwise deletion, or complete case analysis, is the simplest approach. With this method, cases are deleted from the sample if they have missing values on any of the variables in the analysis. No bias will be introduced because the subsample of complete cases is essentially a random sample of the original sample. However, if the missing data is due to influence from another variable, then bias will be introduced using listwise deletion. Pair-wise deletion, also called available case analysis, assumes that data are missing completely at random. Missing values would be estimated using all cases that had data for each variable or pair of variables in the analysis. The replacement method includes a variety of techniques. One of the most simple is to replace the missing value with the central tendency score of the sample. Replacement
Key Terms Bar code Codebook Coding Content analysis Data entry Data field Data file Data preparation Database Don’t know response Editing Missing data Optical character recognition Optical mark recognition Precoding Record Spreadsheet Voice recognition
Describing Data Statistically Appendix 16a Describing Data Statistically This chapter presents the steps necessary to prepare data for data analysis.
Cumulative Percentage Cumulative Percentage Frequencies A Unit Sales Increase (%) Frequency Percentage Cumulative Percentage 5 6 7 8 9 Total 1 2 3 11.1 22.2 33.3 100.0 66.7 88.9 100 B Unit Sales Increase (%) Frequency Percentage Cumulative Percentage Origin, foreign (1) 6 7 8 1 2 11.1 22.2 33.3 55.5 Origin, foreign (2) 5 9 Total 100.0 66.6 77.7 88.8 This chapter begins with a review of critical concepts from statistics courses. Exhibit 16a-1 provides an example of frequencies and distributions based on sales of LCD TVs. A frequency table arrays category codes from lowest value to highest value, with columns for count, percent, percent adjusted for missing values, and cumulative percent. A frequency distribution is an ordered array of all values for a variable. The table arrays data by assigned numerical value, in this case the actual percentage unit sales increase recorded. To discover how many manufacturers were in each unit sales increase category, read the frequency column. The cumulative percentage reveals the number of manufacturers that provided a response and any others that preceded it in the table. This column is helpful when the data have an underlying order. The proportion is the percentage of elements in the distribution that a criterion. In the example, the criterion is the origin of manufacture.
Distributions In Exhibit 16a-2, shown in the slide, the bell-shaped curve that is superimposed on the distribution of annual unit sales increases for LCD TV manufacturers is called the normal distribution. The distribution of values for any variable that has a normal distribution is governed by a mathematical equation. This distribution is a symmetrical curve and reflects a frequency distribution of many natural phenomena such as the height of people of a certain gender and age. Many variables of interest that researchers will measure will have distributions that approximate a standard normal distribution. A standard normal distribution is a special case of the normal distribution in which all values are given standard scores. The distribution has a mean of 0 and a standard deviation of 1. A standard score (z score) conveys how many standard deviation units a case is above or below the mean. The Z score, being standardized, allows the comparison of the results of different normal distributions.
Characteristics of Distributions The standard normal distribution shown in Exhibit 16a-3 is a standard of comparison for describing distributions of sample data. It is used with inferential statistics that assume normally distributed variables.
Measures of Central Tendency Mean Median Mode Central tendency is a measure of location. The common measures of central tendency include the mean, median, and mode. The mean is the arithmetic average of a data distribution. The median is the midpoint of a data distribution. The mode is the most frequently occurring value in a distribution. There may be more than one mode in a distribution. When there is more than one score that has the highest yet equal frequency, the distribution is bimodal or multimodal.
Measures of Variability Variance Quartile deviation Standard deviation Dispersion This slide lists the common measures of variability, also referred to as dispersion or spread. The variance is a measure of score dispersion about the mean. If all the scores are identical, the variance is 0. The greater the dispersion of scores, the greater the variance. Variance is used with interval and ratio data. It is computed by summing the squared distance from the mean for all cases and dividing the sum by the total number of cases minus 1. The standard deviation summarizes how far away from the average the data values typically are. It is the most frequently used measure of spread because it improves interpretability by removing the variance’s square and expressing deviations in their original units. It reveals the amount of variability within the data set. The standard deviation is calculated by taking the square root of the variance. The range is the difference between the largest and smallest scores in the distribution. The interquartile range (IQR) is the difference between the first and third quartiles of the distribution. It is also called the midspread. The quartile deviation is always used with the median for ordinal data. It is helpful for interval or ratio data when the distribution is stretched by extreme values. Interquartile range Range
Summarizing Distributions with Shape The measures of shape, skewness and kurtosis, describe departures from the symmetry of a distribution and its relative flatness. They use deviation scores. Deviation scores show us how far any observation is from the mean. Skewness is a measure of a distribution’s deviation from symmetry. In a symmetrical distribution, the mean, mode, and median are in the same location. A distribution that has cases stretching toward one tail or the other is called skewed. Kurtosis is a measure of a distribution’s peakedness. The symbol for kurtosis is ku. Intermediate or mesokurtic distributions approach normal. The value of ku for a normal distribution is close to 0. A leptokurtic distribution will have a positive value. Distributions that have scores which cluster heavily or pile up in the center are peaked or leptokurtic. Flat distributions are called platykurtic; the platykurtic distribution will be negative.
Symbols _ _ _
Key Terms Central tendency Descriptive statistics Deviation scores Frequency distribution Interquartile range (IQR) Kurtosis Median Mode Normal distribution Quartile deviation (Q) Skewness Standard deviation Standard normal distribution Standard score (Z score) Variability Variance