Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health.

Similar presentations


Presentation on theme: "Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health."— Presentation transcript:

1 Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health Survey Interviewer’s Manual.

2 FCAT 2009 Data Cleaning 1.Data Integrity 2.Data Formats in FCAT 3.Data Challenges in FCAT

3 Data Integrity If Data is acceptable to use for statistical analysis, that means it has: INTEGRITY Test: Will researchers question the results of a study simply based upon the data set that was used?

4 Data Integrity (continued) Data has integrity if it is valid and reliable Internal validity The concept you are trying to capture should be accurately measured External validity What populations do your findings apply to? (also known as “generalizability”) Does your sample represent the population? Statistical Validity Will statistical models yield valid results? Reliability Can the results be replicated or repeated?

5 Good Data Importance of good data: Accuracy in findings Helps direct policy and operations Contributes to development of products and services

6 Examples of Integrity: Recall High Validity, Low Reliability (Measurement Error) Example: Expenditure recall over long periods Solution: Shorten periods, verify responses, reframe questions (health is better or worse than average?)

7 Examples of Integrity: Self-Reporting Low Validity, High Reliability (Systematic Bias) Example: 85% of motorists self-report that they are above-average drivers Solution: Ask their friends to rate them

8 1.Data Integrity 2.Data Formats in FCAT 3.Data Challenges in FCAT FCAT 2009 Data Cleaning

9 Data Formats in FCAT Data is recorded in 5 different formats: Categorical Non-overlapping, exclusive, and finite Ex. Home Ownership 1. Owned 2. Leased 3. Privately rented 4. Government rented 5. Rent free 6. Squatted 7. Other, please write-in Ordinal/Scaled Rated according to a given scale Ex. Rate the loan application process: 1. Very difficult 2. Difficult 3. Easy 4. Very easy Binary Yes or No 1 23

10 Data Formats in FCAT (Continued) Data is recorded in 5 different formats: Write-ins Text write-ins Ex. Others please write in response: ______ *Be aware of the type of response expected to avoid inconsistencies and outliers. Open-Ended Number write-in Ex. Food expenditures for the week: __ (in local currency) Time to gather water: __ (minutes) *Note: Always record units of measure 4 5

11 1.Data Integrity 2.Data Formats in FCAT 3.Data Challenges in FCAT FCAT 2009 Data Cleaning

12 Data Challenges in FCAT Inconsistent values Outliers Missing values Calculated values Others Cleaning data

13 Data Cleaning Data is only useable if it is properly cleaned As the interviewer and the one familiar with the data, it is your job to ensure that the data is correct

14 Inconsistent Values Continue w/ FINCA? 1=Yes, 2=No Who made the decision to leave? Why did FINCA or Village Bank ask you to leave? Do you plan to return in the future? 1=Yes, 2=No 2Village BankClient defaulted1 2ClientN/A 1 Client defaultedN/A 1. Definition: When a second response is made invalid (either impossible or simply inaccurate) by an earlier given answer 2. Examples: 3. Treatment: a. Filter b. Annotate (shaded cells show inconsistencies):

15 Outliers 1. Definition: Response outside the range of values

16 Outliers (continued) 2. Examples: 1)In general how is your health at this time? 1. Excellent 2. Good 3. Poor 4. Very Poor Answer: 7 2)How much does your household spend per week for food? Answer in Ecuador: $10,000 3. Treatment: a. Filter b. Annotate c. Correct value, if possible (e.g. mean of positive values) Special mention: Inliers. If a question calls for integers and the recorded answer is a decimal. e.g. recording a child’s age as.5 if he is yet to complete a year. Outlier: Response is out of answer range Outlier: Response amount is very unlikely

17 Missing Values Continue w/ FINCA? 1=Yes, 2=No Who made the decision to leave? Why did FINCA or Village Bank ask you to leave? Do you plan to return in the future? 1=Yes, 2=No 2Village BankGroup dissolved1 2Client defaulted2 1N/A 1. Definition: a. Stated information not recorded, not legitimate skips b. _____ 2. Examples: 3. Treatment: a. Filter b. Annotate c. Correct value, if possible (in shaded cells) Ex. If you can distinguish between missing value and legitimate skips, replace missing values with the mean over a defined sample (e.g. branch or region).

18 Calculated Values and Other Challenges Calculated Values 1.Definition:Data derived from sub-aggregated variables 2.Examples:DPCE, PPP converted from local currency unit 3.Treatment: Record units of measure Check formulas Others Text is text; numbers are numbers. Do not write in text responses for columns that accept only numbers. Please use the “Other” or “Notes” columns for this purpose.

19 Cleaning Data – Do’s Frequent and periodic End of the day Much easier to clean 20 interviews than 80 or 320! Smaller samples are easier to manage Avoids locality effects on false identification Avoids contamination of derived variables (e.g. DPCE) Keep two files: Raw data Cleaned data Always keep a back-up as well Record and annotate all data issues in a log or tracking document Techniques: Filtering Histograms Pivot tables In other words, do not let data problems snowball

20 Client ID Collection Please collect Client ID information from EACH client interviewed. It is not a violation of privacy, and you can assure the client that their personal information will not harm them in any way, that their responses will be to help make decisions to better loan products and services.

21 SurveyID For Entry into the Data Warehouse, we need to create a PRIMARY KEY for the Main Form to link to the cleaned Subform. The code appears like this when finished: DC20083101 (2 letter country code, the year collected, and an overall interview number from one fellow) Fellows should give each other a number (1, 2, or 3), and then should add a column in BOTH the main form AND the Household Subform.

22 SurveyID (cont’d) Fellow 1 should take his/her overall individual interview number and add 1000 to it, fellow 2 should add 2000, and fellow 3 should add 3000. Ecuador=EC Zambia=ZM Therefore, the 14 th interview performed by Mexico Fellow #2 would be MX20092014. It would read that in the main form AND the HHSubform. Please maintain this convention throughout the Fellowship.

23 Clean Data Data is “clean” if: All categorical codes match those in the survey design sheet *Ex.: Match drinking water sources with codes 1-15 All ordinal data are represented as whole numbers *Ex.: Do not have 3.4 years of education Outliers have been justified Missing data have been correctly annotated

24 Questions?


Download ppt "Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health."

Similar presentations


Ads by Google