Presentation is loading. Please wait.

Presentation is loading. Please wait.

Checking data quality.

Similar presentations


Presentation on theme: "Checking data quality."β€” Presentation transcript:

1 Checking data quality

2 Objectives Duplicates, missing data and exact values
Checking data quality Objectives Duplicates, missing data and exact values Checking ranges and legal values and cleaning our data Digit preference Age heaping

3 Checking data quality Data quality The tests on this session can be applied to all types of data

4 Completeness Missing data can introduce bias Structural integrity of:
Checking data quality Completeness Missing data can introduce bias Structural integrity of: Clusters Households Individuals and children Dates of birth Missing data is often not randomly missed. In a survey often the most distant houses or the most complicated children to be measured are missing. This can lead to bias and non representativeness. Assessing the completeness of reporting provides confidence in the survey and its implementation It is necessary to ensure the completeness of the data collected In anthropometric surveys, this is not just ensuring that all eligible children are accounted for but includes structural integrity checks on all aspects of the data. Clusters: All selected clusters are visited. Households: All selected households in the clusters are interviewed or recorded as not interviewed (with the reason). Household members: All household rosters are complete, with all household members listed, and their key characteristics such as age, sex, and residency are provided. Children: All eligible children are interviewed and measured or recorded as not interviewed/measured (with the reason) in the dataset, with no duplicate cases. Dates of birth: Date of birth for all eligible children are complete.

5 Completeness Reported as a proportion
Checking data quality Completeness Reported as a proportion numerators and denominators as well as resulting ratios should be presented Easy to measure and results easy presented as % Can be done for each unit: clusters, households, children and dates of birth Typically, all clusters are visitable, but in surveys in which some clusters are not visited, the number of clusters in each stratum not visited should be provided in the report However it is unfair to compare SMART, DHS and MICS against this control: SMART is a narrow-topic survey and teams are dedicated to anthropometric indicators. Also, in the DHS and MICS, all children are recorded, making it possible to view who is measured and who is not measured. In the NNS, however, there is no household roster . Information on children who were not part of the measurement sample is not captured. Therefore, it is not clear if the complete denominator of eligible children for whom anthropometric data could have been collected was recorded. πΆβ„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’ π·π‘œπ΅= 𝑁º π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž π‘‘π‘Žπ‘¦, π‘šπ‘œπ‘›π‘‘β„Ž π‘Žπ‘›π‘‘ π‘¦π‘’π‘Žπ‘Ÿ π‘Ÿπ‘’π‘π‘œπ‘Ÿπ‘‘π‘’π‘‘ π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘–π‘’π‘€π‘’π‘‘

6 Checking data quality Completeness πΆβ„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž π‘π‘œπ‘šπ‘π‘™π‘’π‘‘π‘’ π·π‘œπ΅= 𝑁º π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž π‘‘π‘Žπ‘¦, π‘šπ‘œπ‘›π‘‘β„Ž π‘Žπ‘›π‘‘ π‘¦π‘’π‘Žπ‘Ÿ π‘Ÿπ‘’π‘π‘œπ‘Ÿπ‘‘π‘’π‘‘ π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘–π‘’π‘€π‘’π‘‘ 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝑠𝑒π‘₯ π‘‘π‘Žπ‘‘π‘Ž= 𝑁º π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž 𝑠𝑒π‘₯ π‘›π‘œπ‘‘ π‘Ÿπ‘’π‘π‘œπ‘Ÿπ‘‘π‘’π‘‘ π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘–π‘’π‘€π‘’π‘‘ Proportion of missing data for all other variables including those used in the calculation of anthropometric z-scores should also be presented 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 π‘€π‘’π‘–π‘”β„Žπ‘‘ π‘‘π‘Žπ‘‘π‘Ž= 𝑁º π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘€π‘–π‘‘β„Ž π‘€π‘’π‘–π‘”β„Žπ‘‘ π‘›π‘œπ‘‘ π‘Ÿπ‘’π‘π‘œπ‘Ÿπ‘‘π‘’π‘‘ π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘β„Žπ‘–π‘™π‘‘π‘Ÿπ‘’π‘› π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘–π‘’π‘€π‘’π‘‘

7 Checking ranges and legal values
Checking data quality Checking ranges and legal values Sometimes data is clearly wrong Decimal points and commas are missing Missing data is coded with a number Sex is coded with letters and numbers Etc We must edit when it is clearly wrong Checking that data are within an acceptable or plausible range is an important basic test to apply to quantitative data See examples in the next slides

8 Checking ranges and legal values
Checking data quality Checking ranges and legal values This is data from a SMART survey in Angola. LetΒ΄s look at the MUAC data, some values are extreme but possible (underlined and in black) but the others are clearly wrong values. But three are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124, and 132. The three values are missing values coded as It is safe to set these three values to missing using the special value mark as NA (can be changed for each software). These extreme values can be easily spotted by ordering the data in ascending and descending order

9 Checking ranges and legal values
Checking data quality Checking ranges and legal values This is data from a SMART survey in Angola. LetΒ΄s look at the MUAC data, some values are extreme but possible (underlined and in black) but the others are clearly wrong values. But three are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124, and 132. The three values are missing values coded as It is safe to set these three values to missing using the special value mark as NA (can be changed for each software). These extreme values can be easily spotted by ordering the data in ascending and descending order

10 Excercise 1 Prepare the file ex01.csv
Calculate the percentage of missing sex data Calculate the percentage of missing weight data Calculate the Percentage of missing MUAC data Note that surveys already validated should have legal values already checked and dealt with but raw data should always be provided.

11 Checking data quality Digit Preference Analysis of the rounding of weight, MUAC and height measurements. Usually, we may notice excessive number of values ending in 0 or 5 for height, weight and MUAC measurements. Overall analysis and analysis by teams. Routine data or other data may also be examined for digit preference. Digit preference is the observation that the final number in a measurement occurs with a greater frequency that is expected by chance. MUAC is a good example: 119, coded as 120. But same for weight, height etc

12 Digit Preference Checking data quality
Measurements in nutritional anthropometry surveys are usually taken and recorded to one decimal place

13 Digit Preference Can occur because of rounding or fake data
Checking data quality Digit Preference Can occur because of rounding or fake data Graphs are a good way to see it but to quantify it a more complex analyses is done: digit preference score Added by Julien : from WHO report : β€œApproximately uniform distributions are not expected for these graphs, but extreme peaks should not be visible.” DPS value Interpretation 0 ≀ DPS < 8 Excellent 8 ≀ DPS < 12 Good 12 ≀ DPS < 20 Acceptable DPS β‰₯ 20 Problematic

14 Digit Preference score (DPS)
Checking data quality Digit Preference score (DPS) Numerical calculation of digit preference Index of dissimilarity The index is calculated automatically. DonΒ΄t worry about the formula Where: Actual percentageis= are the percentages for the terminal digits in the survey (e.g. number of height measurements with a terminal digit of zero/all height measurements) Expected percentageieΒ = are percentages of the expected distribution (i.e. 10% on each terminal digit)

15 DPS interpretation WHO does not provide recommendations From SMART
Checking data quality DPS interpretation WHO does not provide recommendations From SMART 0 ≀ DPS < 8 Excellent 8 ≀ DPS < 12 Good 12 ≀ DPS < 20 Acceptable DPS β‰₯ 20 Problematic Check histograms first. Digit preference can be an indication of data fabrication or inadequate care and attention during data collection and recording. Identification of which digits suffer from overrepresentation can provide some insight into the type of error. For example, if the frequency distribution indicated significant digit preference occurs for digits of 0 and/or 5, it may indicate that measurers were rounding. If the preference occurs for digits other than 0 and 5, then it is possible that the data have been constructed fictitiously After check DPS: Digit Preference Score ranges from The ideal value of the index is 0 WHO does not provide any guidance except saying that the ideal value of the index is 0. SMART parameters are not widely accepted but can be used as an idnication

16 Age Heaping report children's ages to the nearest year
Checking data quality Age Heaping report children's ages to the nearest year Or selection bias towards older/younger children Poor records, maternal recall, event calendar Often more heaping is observed for older children Mortality and fertility rates can modify it Age heaping is the tendency to report children's ages to the nearest year or adults’ ages to the nearest multiple of five or ten years. Age heaping is very common. This is a major reason why data from nutritional anthropometry surveys is often analysed and reported using broad age groups. The effect is important when there is systematic rounding up or systematic rounding down. Systematic rounding can lead to bias. If rounding is systematically down then indices will be biased upwards and prevalence biased downwards. If rounding is systematically up then indices will be biased downwards and prevalence biased upwards.

17 Age Heaping Histograms are a good way to check it
Very common and can affect many nutrition indicators. We expect all ages to be present with roughly equal frequency or with frequency reducing slowly with age due to mortality. We can see that in this example there is marked age-heaping at 12, 18, 24, 30, 36, and 48 months. This is very common when age is reported by mothers. This is because of a tendency for mothers and other carers to round ages to whole years or half years. Histograms are good to present and interpret age heaping. Chi sq tests can also be done January 2019 Addis Ababa

18 Age Heaping Histograms are a good way to check it
Very common and can affect many nutrition indicators. We expect all ages to be present with roughly equal frequency or with frequency reducing slowly with age due to mortality. We can see that in this example there is marked age-heaping at 12, 18, 24, 30, 36, and 48 months. This is very common when age is reported by mothers. This is because of a tendency for mothers and other carers to round ages to whole years or half years. Histograms are good to present and interpret age heaping. Chi sq tests can also be done January 2019 Addis Ababa

19 Conclussions % of missing data is a useful indicator of data quality
Checking data quality Conclussions % of missing data is a useful indicator of data quality Rounding can create poor data quality, we can check the digit preference of MUAC, weight and height using DPS Age heaping is a common occurrence: it can be check with histograms or with a Chi-sq test DonΒ΄t worry if you donΒ΄t know how to do a chi-sq test. We will do that later! January 2019 Addis Ababa

20 Excercise 2 Divide in 4groups
The file ex02a.csv and ex02c.csv are CSV files containing anthropometric data for children in a single state of a West African country in a Demographic and Health Survey (DHS) The file ex02b.csv is a CSV file containing routine anthropometric data from CMAM programs during 2018 in Dadaab Camp The file ex01d.csv is a CSV file containing anthropometric data from a Rapid Assessment Method for Older People (RAM-OP) survey in the Dadaab refugee camp in Garissa, Kenya. This is a survey of people aged sixty years and older.

21 Excercise 2 Team A: calculate DP for Height and Weight
Team B: calculate DP for MUAC Team C: age heaping in children Team D: age heaping in adults


Download ppt "Checking data quality."

Similar presentations


Ads by Google