Business Research Methods Data Preparation
Data Preparation Data preparation refers to the process of ensuring the accuracy of the data and their conversion from raw form into classified forms appropriate for analysis.
Data Preparation Process Validation Editing Coding Data Entry Data Cleaning Tabulation & Analysis
Questionnaire Checking A questionnaire returned from the field may be unacceptable for several reasons. Parts of the questionnaire may be incomplete. The pattern of responses may indicate that the respondent did not understand or follow the instructions. The responses show little variance. One or more pages are missing. The questionnaire is received after the preestablished cutoff date. The questionnaire is answered by someone who does not qualify for participation. Validation & Editing help in preparing data for data entry
Validating It is the process of ascertaining whether the interviews conducted complied with specified norms It helps in detecting any fraud or failure by interviewer to follow specified instructions Questionnaire has a separate place to record respondent’s name, address, telephone number & other demographic details & date of interview It is the basis for “validation” to confirm if the interview was really conducted
Editing It is the process of checking for mistakes by interviewer or respondent in filling the questionnaire It is a manual process which is generally done twice, first by the service firm which conducted interviews & second by the market research firm The first check is generally done by the field supervisors in the field itself Problems to be checked in editing involve - Finding out whether the interviewer has followed skip pattern - Whether responses to open ended questions have been properly obtained
Editing During editing some illegible, incomplete, inconsistent or ambiguous responses are found which are called unsatisfactory responses . Treatment of Unsatisfactory Results Returning to the Field – The questionnaires with unsatisfactory responses may be returned to the field, where the interviewers recontact the respondents. Assigning Missing Values – If returning the questionnaires to the field is not feasible, the editor may assign missing values to unsatisfactory responses. Discarding Unsatisfactory Respondents – In this approach, the respondents with unsatisfactory responses are simply discarded.
Coding Coding : It is the process of assigning a symbol, usually a number, to each possible response to each question. Coding is necessary for efficient data analysis Categorization of responses to be done for the purpose of coding should be: --Appropriate :If income is important variable wider income classification may not be appropriate --Exhaustive :Should list all possible alternatives --Mutually Exclusive: Responses should not fit into more than one category
Coding Coding closed ended questions is easy since there are a definite number of predetermined responses Closed ended questions are generally pre-coded & hence intermediate step of framing the codes prior to data entry can be avoided Coding the data from open ended questions is much more difficult as responses are unlimited & vary .
Coding Guidelines for coding unstructured questions: Category codes should be mutually exclusive and collectively exhaustive. Only a few (10% or less) of the responses should fall into the “other” category. Category codes should be assigned for critical issues even if no one has mentioned them. Data should be coded to retain as much detail as possible.
Content Analysis for open ended questions Qualitative technique used to analyze text provided in the response category of open ended questions It systematically & objectively derives categories of responses that represents homogeneous thoughts & opinions It identifies responses particularly relevant to the survey It requires the researcher to name categories through a detailed examination of data ( as against pre-coding) It is an iterative interpretation process of first reading the responses, then rereading them to establish meaningful categories The number & meaning of categories are further refined so that it is most representative of the respondents’ text Each response is classified into as many categories as necessary to capture full picture Responses out of context of the question are not coded
Codebook A codebook contains coding rules and the necessary information about each variable in the survey. A codebook generally contains the following information question number ---(3) variable number ----(4) variable name ----(Brand) instructions for coding--- 1=Amul 2=Cadbury 3=Nestle
Coding Don’t Knows Don’t know is included in possible answers Respondents choose this either because they genuinely don’t know or because they don’t want to answer A considerable number of DK responses may be generated for some questions Researcher can either ignore them or allocate the frequency to all other responses in the ratio they occur How many chocolates you eat in a typical week? 300(<20):200(>20):50(DK) 330(<20):220(>20)
Data Entry or Transcribing Data entry involves transferring coded data from questionnaires or coding sheets into computers through keypunching Data collected through CATI or CAPI are entered directly into computer Besides keypunching data can be transferred using optical scanning, mark sense forms or computerised sensory analysis Optical scanners can read responses on questionnaires. They can read darkened small circles & process marked answers .Used in correction of papers in competitive exams. Transcription of UPC data at checkout counters in supermarkets Mark sense forms require responses to be recorded with special pencil in a predestinated area coded for that response. The data can then be read by a machine Computerised sensory analysis automate data collection process. Questions appear on a computerised gridpad & responses are recorded directly into computer using a sensing device
Data Cleaning Data cleaning is undertaken after data entry & includes ----consistency checks ----treatment of missing values Compared to preliminary consistency checks during editing ,checking at this stage is more thorough & extensive as it uses computers
Data Cleaning Consistency Checks Consistency checks identify data that are out of range, logically inconsistent, or have extreme values. Computer packages like SPSS, SAS, EXCEL and MINITAB can be programmed to identify out-of-range values for each variable and print out the respondent code, variable code, variable name, record number, column number, and out-of-range value. Extreme values should be closely examined.
Data Cleaning Treatment of Missing Responses Substitute a Neutral Value – A neutral value, typically the mean response to the variable, is substituted for the missing responses. Substitute an Imputed Response – The respondents' pattern of responses to other questions are used to impute or calculate a suitable response to the missing questions. In casewise deletion, cases, or respondents, with any missing responses are discarded from the analysis. In pairwise deletion, instead of discarding all cases with any missing values, the researcher uses only the cases or respondents with complete responses for each calculation. After data cleaning computer data file is deemed clean &ready for analysis