Data Preparation and Description

Slides:



Advertisements
Similar presentations
Learning Objectives In this chapter you will learn about measures of central tendency measures of central tendency levels of measurement levels of measurement.
Advertisements

Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Processing and Fundamental Data Analysis CHAPTER fourteen.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences CHAPTER.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Descriptive Statistics
QUANTITATIVE DATA ANALYSIS
Calculating & Reporting Healthcare Statistics
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
Data Preparation and Description
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Introduction to Educational Statistics
Quantifying Data.
Measures of Central Tendency
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Chapter 3 – Descriptive Statistics
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
PPA 501 – Analytical Methods in Administration Lecture 5a - Counting and Charting Responses.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Descriptive Statistics
Research Methodology Lecture No : 21 Data Preparation and Data Entry.
Warsaw Summer School 2014, OSU Study Abroad Program Variability Standardized Distribution.
DATA PREPARATION AND DESCRIPTION Chapter 15 McGraw-Hill/IrwinCopyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.
Data Preparation and Description Lecture 25 th. RECAP.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter Twelve Copyright © 2006 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
15-1 Chapter Fifteen DATA PREPARATION AND DESCRIPTION.
16-1 Chapter 16 Data Preparation andDescription Learning Objectives Understand... importance of editing the collected raw data to detect errors.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Data Preparation and Description Lecture 24 th. Recap If you intend to undertake quantitative analysis consider the following: type of data (scale of.
Chapter 15 Data Preparation andDescription McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All Rights Reserved.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
COMPLETE BUSINESS STATISTICS
Descriptive Statistics ( )
Exploratory Data Analysis
Measurements Statistics
Descriptive Statistics
Part Four ANALYSIS AND PRESENTATION OF DATA
Analysis and Empirical Results
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Analyzing and Interpreting Quantitative Data
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Descriptive Statistics
Description of Data (Summary and Variability measures)
Data Preparation and Description
Numerical Descriptive Measures
An Introduction to Statistics
Basic Statistical Terms
Warm up – Unit 4 Test – Financial Analysis
Numerical Descriptive Measures
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Data Preparation and Description
Numerical Descriptive Measures
Section 2-1 Review and Preview
Data Preparation and Description
Chapter Nine: Using Statistics to Answer Questions
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Presentation transcript:

Data Preparation and Description Chapter 16 Data Preparation and Description This chapter presents the steps necessary to prepare data for data analysis.

Learning Objectives Understand . . . importance of editing the collected raw data to detect errors and omissions how coding is used to assign number and other symbols to answers and to categorize responses use of content analysis to interpret and summarize open questions

Learning Objectives Understand . . . problems and solutions for “don’t know” responses and handling missing data options for data entry and manipulation

Exhibit 16-1 Data Preparation in the Research Process Once the data begin to flow, a researcher’s attention turns to data analysis. This chapter focuses on the first phases of that process, data preparation and description. Data preparation includes editing, coding, and data entry and is the activity that ensures the accuracy of the data and their conversion from raw form to reduced and classified forms that are more appropriate for analysis. Preparing a descriptive statistic summary is another preliminary step that allows data entry errors to be identified and corrected. Exhibit 16-1 reflects the steps in this phase.

Editing Accurate Consistent Criteria Arranged for simplification Uniformly entered The customary first step in analysis is to edit the raw data. Editing detects errors and omissions, corrects them when possible, and certifies the maximum data quality standards are achieved. The purpose is to guarantee that data are accurate, consistent with the intent of the question and other information in the survey, uniformly centered, complete, and arranged to simplify coding and tabulation. Complete

Field Editing Field editing review Entry gaps identified Callbacks made Validate results In large projects, field editing review is a responsibility of the field supervisor. It should be done soon after the data have been collected. During the stress of data collection, data collectors often use ad hoc abbreviations and special symbols. If the forms are not completed soon, the field interviewer may not recall what the respondent said. Therefore, reporting forms should be reviewed regularly. When entry gaps are present, a callback should be made rather than guessing what the respondent probably said. The field supervisor also validates field results by reinterviewing some percentage of the respondents on some questions to verify that they have participated. Ten percent is the typical amount used in data validation. In this ad, Western Wats, a data collection specialist reminds us that speed without accuracy won’t help a marketing decision maker choose the right direction.

Central Editing Be familiar with instructions given to interviewers and coders Do not destroy the original entry Make all editing entries identifiable and in standardized form At this point, the data should get a thorough editing. For a small study, a single editor will produce maximum consistency. For large studies, editing tasks should be allocated by sections. Sometimes it is obvious that an entry is incorrect and the editor may be able to detect the proper answer by reviewing other information in the data set. This should only be done when the correct answer is obvious. If an answer given is inappropriate, the editor can replace it with a no answer or unknown. The editor can also detect instances of armchair interviewing, fake interviews, during this phase. This is easiest to spot with open-ended questions. Initial all answers changed or supplied Place initials and date of editing on each instrument completed

Exhibit 16-2 Sample Codebook Coding involves assigning numbers or other symbols to answers so that the responses can be grouped into a limited number of categories. In coding, categories are the partitions of a data set of a given variable. For instance, if the variable is gender, the categories are male and female. Categorization is the process of using rules to partition a body of data. Both closed and open questions must be coded. Numeric coding simplifies the researcher’s task in converting a nominal variable like gender to a “dummy variable.” A codebook contains each variable in the study and specifies the application of coding rules to the variable. It is used by the researcher or research staff to promote more accurate and more efficient data entry. It is the definitive source for locating the positions of variables in the data file during analysis.

Exhibit 16-3 Precoding Precoding means assigning codebook codes to variables in a study and recording them on the questionnaire. It is helpful for manual data entry because it makes the step of completing a data entry coding sheet unnecessary. With a precoded instrument, the codes for variable categories are accessible directly from the questionnaire.

Exhibit 16-3 Coding Open-Ended Questions One of the primary reasons for using open-ended questions is that insufficient information or lack of a hypothesis may prohibit preparing response categories in advance. Researchers are forced to categorize responses after the data area collected. In Exhibit 16-3, question 6 illustrates the use of an open-ended question. After preliminary evaluation, response categories were created for that item. They can be seen in the codebook.

Coding Rules Appropriate to the research problem Exhaustive Categories should be Appropriateness is determined at two levels: 1) the best partitioning of the data for testing hypotheses and showing relationships and 2) the availability of comparison data. Researchers often add an “other” option to a measurement question because they know they cannot anticipate all possible answers. The need for a category set to follow a single classification principle means that every option in the category set is defined in terms of one concept or construct. Mutually exclusive Derived from one classification principle

QSR’s XSight software for content analysis. Content analysis measures the semantic content or the what aspect of a message. It is used for open-ended questions. QSR’s XSight software allows the researcher to develop different categories for analysis without losing the verbatims that may be crucial to an advertising, PR, packaging, or product development effort. QSR, the company that provided us with N6, the latest version of NUD*IST, and N-VIVO, introduced a commercial version of the content analysis software in 2004, XSight. XSight was developed for and with the input of researchers. www.qsrinternational.com QSR’s XSight software for content analysis.

Types of Content Analysis Syntactical Referential Propositional Content analysis follows a systematic process for coding and drawing inferences from texts. It starts by determining which units of data will be analyzed. In written or verbal texts, data units are of four types. Each unit type is the basis for coding texts into mutually exclusive categories. Syntactical units can be words, phrases, sentences, or paragraphs. Referential units are described by words, phrases, and sentences and may be objects, events, persons, etc. Propositional units are assertions about an object, event, or person. Thematic units are topics contained within and across texts. Georgia-Pacific launched the “Do you know a Brawny Man?” essay contest and used content analysis to define the traits of the icon. As a result, the company replaced the old “Brawny Man” with a dark-haired, clean-shaven, sensitive male. Thematic

Exhibit 16-4 & 16-5 Open-Question Coding Locus of Responsibility Mentioned Not Mentioned A. Company ________________________ B. Customer C. Joint Company-Customer F. Other Locus of Responsibility Frequency (n = 100) A. Management 1. Sales manager 2. Sales process 3. Other 4. No action area identified B. Management 1. Training C. Customer 1. Buying processes 2. Other 3. No action area identified D. Environmental conditions E. Technology F. Other 10 20 7 3 15 12 8 5

Exhbit 16-7 Handling “Don’t Know” Responses Question: Do you have a productive relationship with your present salesperson? Years of Purchasing Yes No Don’t Know Less than 1 year 10% 40% 38% 1 – 3 years 30 32 4 years or more 60 Total 100% n = 650 100% n = 150 100% n = 200 When the number of “don’t know” (DK) responses is low, it is not a problem. However, if there are several given, it may mean that the question was poorly designed, too sensitive, or too challenging for the respondent. The best way to deal with undesired DK answers is to design better questions at the beginning. If DK response is legitimate, it should be kept as a separate reply category.

Data Entry Keyboarding Database Programs Digital/ Barcodes Optical Recognition Data entry converts information gathered by secondary or primary methods to a medium for viewing and manipulation. Keyboarding remains the primary method. However, new methods are making data entry more efficient. Voice recognition

Missing Data Listwise Deletion Pairwise Deletion Replacement Missing data are information from a participant or case that is not available for one or more variables of interest. Missing data typically occur in surveys when respondents accidentally skip, refuse to answer, or do not know the answer to an item on the questionnaire. In these situations, it is also referred to as item non-response. Missing data can also be caused by researcher error and corrupted data files. In the spreadsheet screen shot in the slide, missing data are noted by a 9 in the cell. There are three basic types of techniques for dealing with missing data: 1) listwise deletion, 2) pair-wise deletion, and 3) replacement of missing values with estimated scores. Listwise deletion, or complete case analysis, is the simplest approach. With this method, cases are deleted from the sample if they have missing values on any of the variables in the analysis. No bias will be introduced because the subsample of complete cases is essentially a random sample of the original sample. However, if the missing data is due to influence from another variable, then bias will be introduced using listwise deletion. Pair-wise deletion, also called available case analysis, assumes that data are missing completely at random. Missing values would be estimated using all cases that had data for each variable or pair of variables in the analysis. The replacement method includes a variety of techniques. One of the most simple is to replace the missing value with the central tendency score of the sample. Replacement

Key Terms Bar code Codebook Coding Content analysis Data entry Data field Data file Data preparation Database Don’t know response Editing Missing data Optical character recognition Optical mark recognition Precoding Record Spreadsheet Voice recognition

Describing Data Statistically Appendix 16a Describing Data Statistically This chapter presents the steps necessary to prepare data for data analysis.

Cumulative Percentage Cumulative Percentage Frequencies A Unit Sales Increase (%) Frequency Percentage Cumulative Percentage 5 6 7 8 9 Total 1 2 3 11.1 22.2 33.3 100.0 66.7 88.9 100 B Unit Sales Increase (%) Frequency Percentage Cumulative Percentage Origin, foreign (1) 6 7 8 1 2 11.1 22.2 33.3 55.5 Origin, foreign (2) 5 9 Total 100.0 66.6 77.7 88.8 This chapter begins with a review of critical concepts from statistics courses. Exhibit 16a-1 provides an example of frequencies and distributions based on sales of LCD TVs. A frequency table arrays category codes from lowest value to highest value, with columns for count, percent, percent adjusted for missing values, and cumulative percent. A frequency distribution is an ordered array of all values for a variable. The table arrays data by assigned numerical value, in this case the actual percentage unit sales increase recorded. To discover how many manufacturers were in each unit sales increase category, read the frequency column. The cumulative percentage reveals the number of manufacturers that provided a response and any others that preceded it in the table. This column is helpful when the data have an underlying order. The proportion is the percentage of elements in the distribution that a criterion. In the example, the criterion is the origin of manufacture.

Distributions In Exhibit 16a-2, shown in the slide, the bell-shaped curve that is superimposed on the distribution of annual unit sales increases for LCD TV manufacturers is called the normal distribution. The distribution of values for any variable that has a normal distribution is governed by a mathematical equation. This distribution is a symmetrical curve and reflects a frequency distribution of many natural phenomena such as the height of people of a certain gender and age. Many variables of interest that researchers will measure will have distributions that approximate a standard normal distribution. A standard normal distribution is a special case of the normal distribution in which all values are given standard scores. The distribution has a mean of 0 and a standard deviation of 1. A standard score (z score) conveys how many standard deviation units a case is above or below the mean. The Z score, being standardized, allows the comparison of the results of different normal distributions.

Characteristics of Distributions The standard normal distribution shown in Exhibit 16a-3 is a standard of comparison for describing distributions of sample data. It is used with inferential statistics that assume normally distributed variables.

Measures of Central Tendency Mean Median Mode Central tendency is a measure of location. The common measures of central tendency include the mean, median, and mode. The mean is the arithmetic average of a data distribution. The median is the midpoint of a data distribution. The mode is the most frequently occurring value in a distribution. There may be more than one mode in a distribution. When there is more than one score that has the highest yet equal frequency, the distribution is bimodal or multimodal.

Measures of Variability Variance Quartile deviation Standard deviation Dispersion This slide lists the common measures of variability, also referred to as dispersion or spread. The variance is a measure of score dispersion about the mean. If all the scores are identical, the variance is 0. The greater the dispersion of scores, the greater the variance. Variance is used with interval and ratio data. It is computed by summing the squared distance from the mean for all cases and dividing the sum by the total number of cases minus 1. The standard deviation summarizes how far away from the average the data values typically are. It is the most frequently used measure of spread because it improves interpretability by removing the variance’s square and expressing deviations in their original units. It reveals the amount of variability within the data set. The standard deviation is calculated by taking the square root of the variance. The range is the difference between the largest and smallest scores in the distribution. The interquartile range (IQR) is the difference between the first and third quartiles of the distribution. It is also called the midspread. The quartile deviation is always used with the median for ordinal data. It is helpful for interval or ratio data when the distribution is stretched by extreme values. Interquartile range Range

Summarizing Distributions with Shape The measures of shape, skewness and kurtosis, describe departures from the symmetry of a distribution and its relative flatness. They use deviation scores. Deviation scores show us how far any observation is from the mean. Skewness is a measure of a distribution’s deviation from symmetry. In a symmetrical distribution, the mean, mode, and median are in the same location. A distribution that has cases stretching toward one tail or the other is called skewed. Kurtosis is a measure of a distribution’s peakedness. The symbol for kurtosis is ku. Intermediate or mesokurtic distributions approach normal. The value of ku for a normal distribution is close to 0. A leptokurtic distribution will have a positive value. Distributions that have scores which cluster heavily or pile up in the center are peaked or leptokurtic. Flat distributions are called platykurtic; the platykurtic distribution will be negative.

Symbols _ _ _

Key Terms Central tendency Descriptive statistics Deviation scores Frequency distribution Interquartile range (IQR) Kurtosis Median Mode Normal distribution Quartile deviation (Q) Skewness Standard deviation Standard normal distribution Standard score (Z score) Variability Variance