Checking data quality.

Slides:



Advertisements
Similar presentations
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week.
Advertisements

Analysis Age and Sex Distribution Data
Multiple Indicator Cluster Surveys Data Interpretation, Further Analysis and Dissemination Workshop Overview of Data Quality Issues in MICS.
Multiple Indicator Cluster Surveys Survey Design Workshop Data Analysis and Reporting MICS Survey Design Workshop.
Math 116 Chapter 12.
MICS Survey Design Workshop Multiple Indicator Cluster Surveys Survey Design Workshop Interpreting Field Check Tables.
Data Presentation.
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Chapter 5 Selecting a Sample Gay, Mills, and Airasian 10th Edition
Evaluation of Age and Sex Distribution Data United Nations Statistics Division.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
Issues concerning the interpretation of statistical significance tests.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Handbook for Health Care Research, Second Edition Chapter 10 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 10 Basic Statistical Concepts.
Lecture 3,4 Dr. Maha Saud Khalid Measurement of disease frequency Ratio Proportions Rates Ratio Proportions Rates BMS 244.
Lecture 5.  It is done to ensure the questions asked would generate the data that would answer the research questions n research objectives  The respondents.
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
or items of information; these will be numbers in context
PROCESSING DATA.
Introduction to Sampling
Statistics – How to use them when evidencing need
Chapter 8: Estimating with Confidence
Virtual University of Pakistan
Chapter 8: Estimating with Confidence
Multiple Indicator Cluster Surveys Survey Design Workshop
PCB 3043L - General Ecology Data Analysis.
SAMPLING (Zikmund, Chapter 12.
8.1 Sampling Distributions
The ‘What’ and ‘Why’ of Vital statistics
Measures of Central Tendency
Introduction to Summary Statistics
An Introduction to Statistics
Introduction to Summary Statistics
Data Collection and Sampling
Inferential Statistics
Organizing and Visualizing Data
Chapter 10: Estimating with Confidence
Week Three Review.
SAMPLING (Zikmund, Chapter 12).
Chapter 8: Estimating with Confidence
Confidence Intervals with Proportions
Chapter 5: Producing Data
Correlation and the Pearson r
Presenting Data in Tables
15.1 The Role of Statistics in the Research Process
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Estimating Population Parameters Based on a Sample
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Advanced Algebra Unit 1 Vocabulary
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Data, Tables and Graphs Presentation.
Chapter 8: Estimating with Confidence
Organizing & Visualizing Data
Extra Anthropometric data quality checks
Data checks: the debate
Day 2: introduction.
Basic Anthropometric data quality checks
Day 2 wrap up.
Day 3: introduction.
تقرير المسح الأولي وزارة الصحة والسكان محافظة أرخبيل سقطرى
SMART Survey Preliminary Results
Presentation transcript:

Checking data quality

Objectives Duplicates, missing data and exact values Checking data quality Objectives Duplicates, missing data and exact values Checking ranges and legal values and cleaning our data Digit preference Age heaping

Checking data quality Data quality The tests on this session can be applied to all types of data

Completeness Missing data can introduce bias Structural integrity of: Checking data quality Completeness Missing data can introduce bias Structural integrity of: Clusters Households Individuals and children Dates of birth Missing data is often not randomly missed. In a survey often the most distant houses or the most complicated children to be measured are missing. This can lead to bias and non representativeness. Assessing the completeness of reporting provides confidence in the survey and its implementation It is necessary to ensure the completeness of the data collected In anthropometric surveys, this is not just ensuring that all eligible children are accounted for but includes structural integrity checks on all aspects of the data. Clusters: All selected clusters are visited. Households: All selected households in the clusters are interviewed or recorded as not interviewed (with the reason). Household members: All household rosters are complete, with all household members listed, and their key characteristics such as age, sex, and residency are provided. Children: All eligible children are interviewed and measured or recorded as not interviewed/measured (with the reason) in the dataset, with no duplicate cases. Dates of birth: Date of birth for all eligible children are complete.

Completeness Reported as a proportion Checking data quality Completeness Reported as a proportion numerators and denominators as well as resulting ratios should be presented Easy to measure and results easy presented as % Can be done for each unit: clusters, households, children and dates of birth Typically, all clusters are visitable, but in surveys in which some clusters are not visited, the number of clusters in each stratum not visited should be provided in the report However it is unfair to compare SMART, DHS and MICS against this control: SMART is a narrow-topic survey and teams are dedicated to anthropometric indicators. Also, in the DHS and MICS, all children are recorded, making it possible to view who is measured and who is not measured. In the NNS, however, there is no household roster . Information on children who were not part of the measurement sample is not captured. Therefore, it is not clear if the complete denominator of eligible children for whom anthropometric data could have been collected was recorded. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝐷𝑜𝐵= 𝑁º 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑑𝑎𝑦, 𝑚𝑜𝑛𝑡ℎ 𝑎𝑛𝑑 𝑦𝑒𝑎𝑟 𝑟𝑒𝑐𝑜𝑟𝑑𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑖𝑒𝑤𝑒𝑑

Checking data quality Completeness 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝐷𝑜𝐵= 𝑁º 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑑𝑎𝑦, 𝑚𝑜𝑛𝑡ℎ 𝑎𝑛𝑑 𝑦𝑒𝑎𝑟 𝑟𝑒𝑐𝑜𝑟𝑑𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑖𝑒𝑤𝑒𝑑 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝑠𝑒𝑥 𝑑𝑎𝑡𝑎= 𝑁º 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑠𝑒𝑥 𝑛𝑜𝑡 𝑟𝑒𝑐𝑜𝑟𝑑𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑖𝑒𝑤𝑒𝑑 Proportion of missing data for all other variables including those used in the calculation of anthropometric z-scores should also be presented 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑𝑎𝑡𝑎= 𝑁º 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑤𝑖𝑡ℎ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑛𝑜𝑡 𝑟𝑒𝑐𝑜𝑟𝑑𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑖𝑒𝑤𝑒𝑑

Checking ranges and legal values Checking data quality Checking ranges and legal values Sometimes data is clearly wrong Decimal points and commas are missing Missing data is coded with a number Sex is coded with letters and numbers Etc We must edit when it is clearly wrong Checking that data are within an acceptable or plausible range is an important basic test to apply to quantitative data See examples in the next slides

Checking ranges and legal values Checking data quality Checking ranges and legal values This is data from a SMART survey in Angola. Let´s look at the MUAC data, some values are extreme but possible (underlined and in black) but the others are clearly wrong values. But three are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124, and 132. The three 999.0 values are missing values coded as 999.0. It is safe to set these three values to missing using the special value mark as NA (can be changed for each software). These extreme values can be easily spotted by ordering the data in ascending and descending order

Checking ranges and legal values Checking data quality Checking ranges and legal values This is data from a SMART survey in Angola. Let´s look at the MUAC data, some values are extreme but possible (underlined and in black) but the others are clearly wrong values. But three are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124, and 132. The three 999.0 values are missing values coded as 999.0. It is safe to set these three values to missing using the special value mark as NA (can be changed for each software). These extreme values can be easily spotted by ordering the data in ascending and descending order

Excercise 1 Prepare the file ex01.csv Calculate the percentage of missing sex data Calculate the percentage of missing weight data Calculate the Percentage of missing MUAC data Note that surveys already validated should have legal values already checked and dealt with but raw data should always be provided.

Checking data quality Digit Preference Analysis of the rounding of weight, MUAC and height measurements. Usually, we may notice excessive number of values ending in 0 or 5 for height, weight and MUAC measurements. Overall analysis and analysis by teams. Routine data or other data may also be examined for digit preference. Digit preference is the observation that the final number in a measurement occurs with a greater frequency that is expected by chance. MUAC is a good example: 119, coded as 120. But same for weight, height etc

Digit Preference Checking data quality Measurements in nutritional anthropometry surveys are usually taken and recorded to one decimal place

Digit Preference Can occur because of rounding or fake data Checking data quality Digit Preference Can occur because of rounding or fake data Graphs are a good way to see it but to quantify it a more complex analyses is done: digit preference score Added by Julien : from WHO report : “Approximately uniform distributions are not expected for these graphs, but extreme peaks should not be visible.” DPS value Interpretation 0 ≤ DPS < 8 Excellent 8 ≤ DPS < 12 Good 12 ≤ DPS < 20 Acceptable DPS ≥ 20 Problematic

Digit Preference score (DPS) Checking data quality Digit Preference score (DPS) Numerical calculation of digit preference Index of dissimilarity The index is calculated automatically. Don´t worry about the formula Where: Actual percentageis= are the percentages for the terminal digits in the survey (e.g. number of height measurements with a terminal digit of zero/all height measurements) Expected percentageie = are percentages of the expected distribution (i.e. 10% on each terminal digit)

DPS interpretation WHO does not provide recommendations From SMART Checking data quality DPS interpretation WHO does not provide recommendations From SMART 0 ≤ DPS < 8 Excellent 8 ≤ DPS < 12 Good 12 ≤ DPS < 20 Acceptable DPS ≥ 20 Problematic Check histograms first. Digit preference can be an indication of data fabrication or inadequate care and attention during data collection and recording. Identification of which digits suffer from overrepresentation can provide some insight into the type of error. For example, if the frequency distribution indicated significant digit preference occurs for digits of 0 and/or 5, it may indicate that measurers were rounding. If the preference occurs for digits other than 0 and 5, then it is possible that the data have been constructed fictitiously After check DPS: Digit Preference Score ranges from 0-100. The ideal value of the index is 0 WHO does not provide any guidance except saying that the ideal value of the index is 0. SMART parameters are not widely accepted but can be used as an idnication

Age Heaping report children's ages to the nearest year Checking data quality Age Heaping report children's ages to the nearest year Or selection bias towards older/younger children Poor records, maternal recall, event calendar Often more heaping is observed for older children Mortality and fertility rates can modify it Age heaping is the tendency to report children's ages to the nearest year or adults’ ages to the nearest multiple of five or ten years. Age heaping is very common. This is a major reason why data from nutritional anthropometry surveys is often analysed and reported using broad age groups. The effect is important when there is systematic rounding up or systematic rounding down. Systematic rounding can lead to bias. If rounding is systematically down then indices will be biased upwards and prevalence biased downwards. If rounding is systematically up then indices will be biased downwards and prevalence biased upwards.

Age Heaping Histograms are a good way to check it Very common and can affect many nutrition indicators. We expect all ages to be present with roughly equal frequency or with frequency reducing slowly with age due to mortality. We can see that in this example there is marked age-heaping at 12, 18, 24, 30, 36, and 48 months. This is very common when age is reported by mothers. This is because of a tendency for mothers and other carers to round ages to whole years or half years. Histograms are good to present and interpret age heaping. Chi sq tests can also be done January 2019 Addis Ababa

Age Heaping Histograms are a good way to check it Very common and can affect many nutrition indicators. We expect all ages to be present with roughly equal frequency or with frequency reducing slowly with age due to mortality. We can see that in this example there is marked age-heaping at 12, 18, 24, 30, 36, and 48 months. This is very common when age is reported by mothers. This is because of a tendency for mothers and other carers to round ages to whole years or half years. Histograms are good to present and interpret age heaping. Chi sq tests can also be done January 2019 Addis Ababa

Conclussions % of missing data is a useful indicator of data quality Checking data quality Conclussions % of missing data is a useful indicator of data quality Rounding can create poor data quality, we can check the digit preference of MUAC, weight and height using DPS Age heaping is a common occurrence: it can be check with histograms or with a Chi-sq test Don´t worry if you don´t know how to do a chi-sq test. We will do that later! January 2019 Addis Ababa

Excercise 2 Divide in 4groups The file ex02a.csv and ex02c.csv are CSV files containing anthropometric data for children in a single state of a West African country in a Demographic and Health Survey (DHS) The file ex02b.csv is a CSV file containing routine anthropometric data from CMAM programs during 2018 in Dadaab Camp The file ex01d.csv is a CSV file containing anthropometric data from a Rapid Assessment Method for Older People (RAM-OP) survey in the Dadaab refugee camp in Garissa, Kenya. This is a survey of people aged sixty years and older.

Excercise 2 Team A: calculate DP for Height and Weight Team B: calculate DP for MUAC Team C: age heaping in children Team D: age heaping in adults