Data Summarization
Data summarization is either by; 1-Measurements of central tendency (average measurements) 2-Measurments of variability (dispersion measurements)
Measures of Central Tendency What is central tendency? The “middle” / “center” of a variable’s distribution. A single score that best describes the entire distribution. How is it calculated? 1. Mode 2. Median 3. Mean
Measurements of variability: The degree to which numerical (quantitative data) tend to spread about an average value is called variation or dispersion of the data. The variability is something that is in the nature of data, i.e. the data always have a variation (not came as one value). There are various measures of variation or dispersion but the most common being used are;
1-Range:
The uses of range;
The IQR formula is: IQR = Q3 – Q1 Where Q3 is the upper quartile and Q1 is the lower quartile.
2. Interquartile Range (IQR) The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread.
3-Variance: The variance is defined as the average of the squared deviation of observations away from their mean in a set of observations. It represents a squared value (so it has no units mostly, as it is not accustomed to use meter2 for length square as a measurement).
Haemoglobin level (g/dL) Haemoglobin level (g/dL) Difference, deviation d=(X-X) d2 D=(X-X)2 X2 8 8-10= -2 4 64 9 9-10= -1 1 81 10 10-10=0 100 11 11-10=+1 121 12 12-10=+2 144 x=50 d= (X-X)=0 d2= (X-X)2=10 x2=510
3-Standard deviation: The SD is defined as the squared root of the variance, or it can be defined as the average of the deviation of observations away from their mean in a set of observations. It is the measure that is accustomed and widely used in biostatistics as a measure of variability. If the value of SD is high it means a large variation the data posses, and if it is of small value it mean a less variation the data posses.
Presentation of Data
Data collected and complied from different types of epidemiological studies are raw data. These are unsorted and are not much helpful for understanding the underlying trends or its meaning.
So, the next step after data collection is to sort and classify the data into characteristic groups or classes like, according to age, sex, social class, number of DMFT, etc. The objective of classification of data is to make the data simple, concise, meaningful, interesting and helpful in further analysis.
There are two main methods of presenting data: Tabulation Diagrams
1. Tabulation: Benefits of the presentation of data by using tables are:
The basic rules have to be followed while forming a frequency distribution tables are :
Example Distribution of study group according to gender, age Study group categories Number Percentage % Gender Male 87 50.9 Female 84 49.1 Age category 30-39 years 17 9.9 40-49 years 29 17.0 50-59 years 64 37.4 60-69 years 61 35.7
2.Diagrams : By arranging the data into tables, we simplify the entire mass of the data, but sometimes it is difficult to understand and compare two or more tables. Diagrams and graphs are one of the most convincing and appealing ways of depicting statistical results, they are extremely useful because they are attractive to the eyes, give a bird eye view of the entire data, have a last impression on the mind of the layman and they facilitate comparison of the relating to different time periods and regions.
The basic rules in the construction of diagrams and graphs are:
Types of diagrams: Depending on the nature of the data, whether it is qualitative or quantitative, the following diagrams may be chosen:
Bar diagram: This diagram is used to represent qualitative data.
. Pie diagram:
Line diagram: this diagram is useful to study changes of values in the variable over time. On the axis X, the time such as hours, days, weeks, months or years are represented and the value of any quantity pertaining to this represented along the axis-Y.
Histogram: this diagram is used to depict quantitative data of continuous type. A histogram is a bar diagram without gap between the bars. It represents the frequency distribution. The histogram is constructed as follows. On the X-axis, class interval is marked and on the Y-axis, the frequencies is marked. A rectangle is drawn above each class interval with height proportional to the frequency of that interval.
Cartograms or spot map: These maps are used to show geographical distribution of frequencies of characteristic.
Figure 6: (a) Dental caries levels (DMFT) of 12-year-olds worldwide. (b) Dental caries levels (DMFT) of 35–44-year-olds worldwide in (2003)