Extra Anthropometric data quality checks

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

SUMMARIZING DATA: Measures of variation Measure of Dispersion (variation) is the measure of extent of deviation of individual value from the central value.
Statistics [0,I/2] The Essential Mathematics. Two Forms of Statistics Descriptive Statistics What is physically happening within the data? Inferential.
Sampling Distributions
Business Statistics - QBM117 Statistical inference for regression.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Distributions of the Sample Mean
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
NORMAL DISTRIBUTION AND ITS APPL ICATION. INTRODUCTION Statistically, a population is the set of all possible values of a variable. Random selection of.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai The Normal Curve and Univariate Normality PowerPoint.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 6- 1.
Stats Methods at IC Lecture 3: Regression.
Sampling and Sampling Distribution
Measurements Statistics
MATH-138 Elementary Statistics
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Modeling Distributions of Data
Two-Sample Hypothesis Testing
Chapter 7 (b) – Point Estimation and Sampling Distributions
Normal Distribution.
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
CHAPTER 2 Modeling Distributions of Data
Sampling Distributions
Sampling Distributions and Estimation
The Standard Deviation as a Ruler and the Normal Model
Z-scores & Shifting Data
Graduate School of Business Leadership
Distribution of the Sample Means
APPROACHES TO QUANTITATIVE DATA ANALYSIS
SAMPLING (Zikmund, Chapter 12.
Variation Coefficient
Sampling Distributions
MEASURES OF CENTRAL TENDENCY
Introduction to Summary Statistics
The normal distribution
Basic Statistical Terms
Introduction to Summary Statistics
Inferential Statistics
The Normal Distribution
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Arithmetic Mean This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
Summary descriptive statistics: means and standard deviations:
CH2. Cleaning and Transforming Data
LESSON 4: MEASURES OF VARIABILITY AND PROPORTION
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Week Three Review.
SAMPLING (Zikmund, Chapter 12).
Warsaw Summer School 2017, OSU Study Abroad Program
PROBABILITY DISTRIBUTION
Honors Statistics Review Chapters 4 - 5
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Geology Geomath Chapter 7 - Statistics tom.h.wilson
8.3 Estimating a Population Mean
These probabilities are the probabilities that individual values in a sample will fall in a 50 gram range, and thus represent the integral of individual.
Advanced Algebra Unit 1 Vocabulary
The Normal Distribution
Data checks: the debate
Basic Anthropometric data quality checks
MGS 3100 Business Analysis Regression Feb 18, 2016
Day 2 wrap up.
Checking data quality.
تقرير المسح الأولي وزارة الصحة والسكان محافظة أرخبيل سقطرى
SMART Survey Preliminary Results
Presentation transcript:

Extra Anthropometric data quality checks

Objectives Dispersion Index Normality Skewness Kurtosis SAM/MAM ratio Anthropometric data quality checks Objectives Dispersion Index Normality Skewness Kurtosis SAM/MAM ratio Mean Z-score

Anthropometric data quality checks The tests on this session are often controversial and should be use with caution but they may give you ideas for future research

Dispersion Use in surveys with clusters Anthropometric data quality checks Dispersion Use in surveys with clusters Examines the heterogeneity of the population in terms The number of malnourished children by cluster should follow a distribution statistically known as Poisson Measures of dispersion summarise how cases (e.g. children classified as wasted, stunted, or underweight) are distributed across a survey’s primary sampling units (e.g. clusters). We compares the data distribution to the Poisson distribution for significant difference. If the if the data follow a Poisson distribution, it means that cases are randomly distributed among clusters; therefore, some clusters will have no malnutrition, some others will have 1, others 2, and so on. However, after reaching a certain threshold, the number of clusters that have more cases will start decreasing (see next). If data do not follow Poisson Distribution  Heterogeneous sample with ‘malnutrition pockets’.

If the if the data follow a Poisson distribution, it means that cases are randomly distributed among clusters; therefore, some clusters will have no malnutrition, some others will have 1, others 2, and so on. However, after reaching a certain threshold, the number of clusters that have more cases will start decreasing (see next). This graph shows the number of cases by cluster for the survey population. If malnutrition is random in a population (Poisson Distribution), it is expected to find clusters that have more cases than others.

Poisson Distribution Homogenous clusters Heterogeneous clusters No significant difference Homogenous clusters p < 0.05 Probably significant difference Heterogeneous clusters However things change if the distribution of the cases is not Poisson. This can indicate that the population that has formed the sample is heterogeneous, with “pockets of malnutrition” and areas that are spared. These problems might be caused by the design of the survey, non-random selection of the villages to contain the clusters, a biased selection of households in some areas or excessive heterogeneity in the surveyed population. If the data do not follow a Poisson distribution, then there will also be a larger than usual design effect. These two statistics give complementary information.

Anthropometric data quality checks Index of Dispersion Test for random distribution or aggregation of cases over the clusters: pockets of malnutrition. 3 Options: Uniform distribution: ID < 1 Random distribution: ID = 1 Aggregated distribution: ID > 1 As with other tests, we can summarise the dispersion through an index. The simplest index of dispersion, and often used, is the variance to mean ratio. The value of the variance to mean ratio can range between zero (maximum uniformity) and the total number of cases in the data (maximum clumping). Maximum uniformity is found when the same number of cases are found in every primary sampling unit. Maximum clumping is found when all cases are found in one primary sampling unit. Other measure is Green's Index of Dispersion.

Index of Dispersion Random distribution Uniform distribution Notes for trainers: If ID < 1 and p<0.05  Cases are uniformly distributed among clusters. If ID > 1 and p<0.05  Cases are aggregated in certain clusters. If we notice that ID for edema is higher than 1 and p<0.05, but it’s not the case for WHZ, we can think that aggregation of GAM and SAM cases is due to the inclusion of edema in GAM and SAM estimates (Michael Golden, 2008). In case of aggregated cases, it is important to look at the analysis by teams in more details in order to find out if the same team was over-reporting cases of malnutrition. Aggregated distribution Uniform distribution

Quick Excercise ID for WHZ<-2 is 1.33 and the p>0.05. What can we assume?

Quick Excercise ID for WHZ<-2 is 1.33 and the p>0.05. Therefore, we can assume that the distribution of the cases of wasting for this survey was random.

Normality Basic Anthropometric data quality checks A normal distribution is an ideal symmetric bell-shaped curve anthropometric variables (e.g. weight, height, and MUAC) and anthropometric indices (e.g. WHZ, HAZ, and WHZ), tend to be normally distributed Understanding the shape of the frequency distribution can provide insights into the survey population and about the quality of the data. It is generally assumed that survey populations will have a normal distribution and that the distribution will shift depending on the level of malnutrition of the population. However, the distribution of malnourished populations can depart from normality especially when many inequities exist or when severe forms of malnutrition are prevalent without necessarily indicating data quality issues

Graphical and numerical summaries Basic Anthropometric data quality checks Graphical and numerical summaries The first way of assessing whether a variable is normally distributed is a simple “by-eye” assessment using histograms. Graphical methods are often more informative than numerical summaries. A key graphical method for examining the distribution of a variable is the histogram. The shape of the distributions for HAZ, WHZ and WAZ should be visualized using histograms Histograms showing the distribution of anthropometric indices. anthropometric data from a SMART survey in Kabul, Afghanistan. These show nearly symmetrical “bell-shaped” distributions. WAZ WHZ

Graphical summaries WAZ WHZ Basic Anthropometric data quality checks The first way of assessing whether a variable is normally distributed is a simple “by-eye” assessment using histograms. Graphical methods are often more informative than numerical summaries. A key graphical method for examining the distribution of a variable is the histogram. The shape of the distributions for HAZ, WHZ and WAZ should be visualized using histograms Histograms showing the distribution of anthropometric indices. anthropometric data from a SMART survey in Kabul, Afghanistan. These show nearly symmetrical “bell-shaped” distributions. WAZ WHZ

Numerical summaries Shapiro-Wilk Test for WHZ, WAZ, HAZ. Basic Anthropometric data quality checks Numerical summaries Shapiro-Wilk Test for WHZ, WAZ, HAZ. Assesses significant difference between data distribution and normal distribution p < 0.05 Data are not normally distributed Find out the reason p > 0.05 Data normally distributed Skewness and Kurtosis can be ignored Another way of assessing normality is to use a formal statistical significance test. The preferred test is the Shapiro-Wilk test of normality: We need to be careful when using significance tests such as the Shapiro-Wilk test of normality because the results can be strongly influenced by the sample size. Small sample sizes can lead to tests missing large effects and large sample sizes can lead to tests identifying small effects as highly significant. If a distribution appears to be normal (i.e. has a symmetrical, or nearly symmetrical, “bell-shaped” distribution) then it is usually safe to assume normality and to use statistical procedures that assume normality. Formal tests for normality can be misleading when sample sizes of more than a few hundred cases are used. Graphical methods are not very useful when sample sizes are small. Formal test are not very useful when sample sizes are large. The sample sizes of most anthropometry surveys will be large enough to cause formal tests for normality to identify small deviations from normality as highly significant.

Quick Excercise The results for Shapiro-Wilk test for normally (Gaussian) distributed data for W/H, when excluding SMART flags was p= 0.075.

Quick Excercise Since it is higher than 0.05, we can therefore assume that the data for weight for height was normally distributed.

Problem with the heterogeneity of the population Basic Anthropometric data quality checks Skewness Measures asymetry If distribution is symmetrical  value of skewness = 0. The value of skewness should lie between -1 and +1. Measures asymmetry. A normal distribution which is perfectly symmetrical will have a skewness value of zero with an equal distribution on both right and left tails. We can usually see skew in histograms. We can also calculate a skewness statistic and test Coefficient too far from the -1 to +1 range Problem with the heterogeneity of the population January 2019 Addis Ababa

Skewness Basic Anthropometric data quality checks Measures asymmetry. A normal distribution which is perfectly symmetrical will have a skewness value of zero with an equal distribution on both right and left tails. While there is not a defined cut off, a general rule of thumb is that when the coefficient for skew is <-0.5 or >+0.5 this is indicative of skewness. Skewed data are not necessarily due to poor quality of data collection. If the data are greatly skewed then great care needs to be taken with interpretation. It is likely that there are distinct subgroups within the population that should have been identified and surveyed separately during the planning phase of the survey. January 2019 Addis Ababa

Problem with the quality of the data. Basic Anthropometric data quality checks Kutosis Measures the “peakedness” of the distribution. Normal distribution: kurtosis = 3. The value of kurtosis should lie between 2 and 4. Coefficient too far from the 2 to +4 range Problem with the quality of the data. Measures the “peakedness” of the distribution. It is a measure of how much a distribution is concentrated about the mean. Kurtosis can be zero, positive, or negative. Zero kurtosis is found when a variable is normally distributed. Positive kurtosis is found when the mass of the distribution is concentrated about the mean and there are very few values far from the mean. Negative kurtosis is found when the mass of the distribution is concentrated in the tails of the distribution. We can usually see kurtosis in histograms. We can also calculate a kurtosis statistic and test January 2019 Addis Ababa

Kurtosis Anthropometric data quality of our surveys Positive Kurtosis: relatively sharp distribution. Positive kurtosis is often generated by large numbers of outlying values – this can occur from errors during data collection or data entry into ENA Negative Kurtosis: relatively flattened distribution. Less common. It can indicate that data have been “over-cleaned” or that the teams have not taken values that they themselves think might be extreme – so that there are far too many values clustered around the mean value. while there is not a defined cut off, in general a kurtosis <2 or >4 is indicative of kurtosis. When kurtosis is greater than 4, the degree of peakedness is low and the curve is flat meaning that there are many extreme values in the tails than the expected in a normal distribution. When Kurtosis is less than 2, the peak is high and thus tails are relatively short. January 2019 Addis Ababa

How to present Always provide histograms Checking data quality How to present Always provide histograms Research the reason for non-normality Check the tails of the HAZ, WHZ, and WAZ distribution. Did they end smoothly or abruptly Skew if skew is <-0.5 or >+0.5 Kurtosis if <+2 or >+4 team. Given that it is unclear what a departure from normality represents for HAZ, WHZ or WAZ (i.e. it may represent malnourished populations with high levels of inequity and/or high levels of severe forms of malnutrition, or departure from normality may represent issues related to data quality or a combination of the two) it is not possible to give advice on what the shape of the distribution means in any given survey until research is undertaken in this area, Check to see if the tails of the HAZ, WHZ, and WAZ distribution end smoothly and not abruptly. If the distribution ends abruptly, this may be indicative of data quality issues. In addition, as the kurtosis for a standard normal distribution is 3, some formulas subtract 3 from the value obtained using the formula presented above, so that the standard normal distribution kurtosis is represented by a value of 0, meaning that these formulas represent "excess kurtosis”. When such formulas are used kurtosis <-1 or >1, are indicative of kurtosis. If skew or kurtosis values fall outside of these ranges, it could be useful to calculate the coefficients of skew and kurtosis by other disaggregations. Most software packages calculate these statistics automatically

Basic Anthropometric data quality checks Conclusions cannot be drawn about the quality of the data based on values of skewness or kurtosis alone. Conversely, deviations from normality in the context of other problematic data quality checks should flag concern. Further research is required to understand distribution patterns for populations with different patterns of malnutrition and also to understand the extent to which values of skewness and kurtosis which deviate from normality represent data quality issues Understanding the shape of the frequency distribution can provide insights into the survey population and about the quality of the data. The WHO Child Growth Standards which were based on a sample of healthy children living in environment that did not constrain growth had a normal distribution for each of the anthropometry z scores. It is generally assumed that survey populations will have a normal distribution and that the distribution will shift depending on the level of malnutrition of the population. However, the distribution of malnourished populations can depart from normality especially when many inequities exist or when severe forms of malnutrition are prevalent (e.g. severe stunting is high, or overweight is a larger problem in certain subpopulations) without necessarily indicating data quality issues. As such, conclusions cannot be drawn about the quality of the data based on values of skewness or kurtosis alone. Conversely, deviations from normality in the context of other problematic data quality checks should flag concern. Further research is required to understand distribution patterns for populations with different patterns of malnutrition and also to understand the extent to which values of skewness and kurtosis which deviate from normality represent data quality issues. January 2019 Addis Ababa

Analysis by Teams Number of children. Proportion of flags. Age ratio. Sex ratio. Digit preference (weight, height and MUAC). Standard deviation. Not often possible in old surveys The problems with measurements usually do not involve all the teams. Often it is due to one poorly trained team or team member that can affect the overall results of the survey. If any particular team has obtained data that is statistically different from the other teams (digit preference, standard deviation), it is likely that this team’s technique has created a systematic bias. If this happens, and if there is time, the aberrant team’s clusters should be re-sampled using a different team and the new data substituted for the aberrant data. If the second team gets data that is similar to the original team’s data then there is probably a real difference between the particular clusters assigned to that team and the remainder of the clusters. If this is the case, then the original data should be retained. The design effect will be unusually high. If the second team’s data are very different from the original data, this confirms that there was a systematic bias in the work of the first team. If re-sampling is not feasible within a reasonable time, then the data should be analyzed with and without the aberrant clusters, and both results reported with a recommendation from the survey manager indicating which result is likely to be more reliable. There has to be a full report of such occurrences and how they are resolved (e.g. perhaps the team’s equipment is faulty or their training has been inadequate.) When examining the SD for the teams, we should first look at the number of children measured by each team, since one team might have had just a few clusters, and this could have influenced her results for the SD. Since each team has a small number of clusters, we cannot expect to have a ratio of 1 for sex or age distribution.

SAM/MAM ratio Fixed relation between MAM and SAM Depends on: Z score mean Z score SD another way of assessing the quality of the survey data Digit preference (weight, height and MUAC). Standard deviation. Normally there is a fixed relationship between moderate and severe wasting depending upon the degree of malnutrition within the community If it is found that there is an excess of severe wasting (SAM without oedema) over moderate plus severe wasting (GAM) or MAM, then this is an indication that the measurements have been taken poorly. It is often assumed that this ratio is constant, but this is not the case. In a normal, non-malnourished population one would expect about 16 moderately wasted cases for every severe case If the population mean WHZ is -0.5 Z-score then for every one severe case there will be ten moderate cases; this is what is normally observed under field conditions. As the population deteriorates to a mean of -1.0 Z there will be six MAM cases for each SAM case. A population mean value of -2.0 Z is very rare, as it means that the prevalence of GAM will be 50% of the children – in these dire situations, there will only be two MAM cases for each SAM case. The ratio has important implications on the effort that is directed to treating each degree of wasting, and hence the design of implantation programmes. If the ratio differs markedly from those shown in Table (next slide), then the data reported in the survey is suspect.

GAM/SAM MAM/SAM Mean SD 0.8 SD 1.0 SD 1.2 0.0 70.2 16.9 7.7 69.2 15.9 WHZ GAM/SAM MAM/SAM Mean SD 0.8 SD 1.0 SD 1.2 0.0 70.2 16.9 7.7 69.2 15.9 6.7 -0.1 60.7 15.4 7.2 59.7 14.4 6.2 -0.2 52.5 14.1 6.8 51.5 13.1 5.8 -0.3 45.5 12.9 6.4 44.5 11.9 5.4 -0.4 39.4 11.8 6.0 38.4 10.8 5.0 -0.5 34.2 5.7 33.2 9.8 4.7 -0.6 29.7 9.9 5.3 28.7 8.9 4.3 -0.7 25.8 9.0 24.8 8.0 4.0 -0.8 22.4 8.3 4.8 21.4 7.3 3.8 -0.9 19.5 7.6 4.5 18.5 6.6 3.5 -1.0 17.0 7.0 4.2 16.0 3.2 -1.1 14.8 13.8 3.0 -1.2 13.0 5.9 12.0 4.9 2.8 -1.3 11.4 3.6 10.4 4.4 2.6 -1.4 10.0 3.4 2.4 -1.5 8.8 4.6 7.8 2.2 -1.6 3.3 2.0 -1.7 3.9 2.9 1.9 -1.8 3.7 2.7 1.7 -1.9 1.6 -2.0 2.5 1.5

Excercise 5 Divide in 4 groups The file ex05.csv is a comma-separated-value (CSV) file containing anthropometric data from a SMART survey in Kabul, Afghanistan.. Provide histograms for WHZ, HAZ and WAZ Calculate Saphiro-Wilks test Calculate Skewness and Kutosis Use this calculator online: http://www.statskingdom.com/320ShapiroWilk.html