Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Statistics

Similar presentations


Presentation on theme: "Descriptive Statistics"— Presentation transcript:

1 Descriptive Statistics

2 Major Points Measures of Central Tendency Measures of Position
Mean Median Mode Measures of Position Quartiles Deciles Percentiles Measures of Dispersion Range Variance Standard Deviation Coefficient of variation Interquartile Range Other descriptive measures Geometric mean Weighted mean

3 Measures of Central Tendency
Commonly known as Averages: an average is a numerical value that indicates the middle point or central region of the raw data. 3 most frequently used measures of central tendency: Mode Median Mean Time to write

4 Why use measures of Central Tendency?
Mathematically summarize data in order to make appropriate comparisons. e.g. You want to describe the age of students attending King Saud University . Therefore you randomly ask 1000 students for their age. Age Frequency 19 159 20 219 21 172 22 146 23 123 25 83 27 48 29 16 32 40 14

5 Mean Example A ten PHL 541 students received a score out of 20 on their quiz. 9, 10, 12, 13, 15, 15, 15, 16, 18, 19 SX/n = ( ) 10 = 142/10 = 14.2 Need time to go through calculation Therefore, the mean of this sample is 14.2

6 Arithmetic Mean Arithmetic Mean: the sum of the scores divided by the number of scores (generally thought of as the average). The mean of a sample of X scores is symbolized as, which is said as “X bar”= The mean of a population of X scores is symbolized by the Greek letter mu (µ). *Need time to write definition

7 Sample Mean The algebraic definition of the sample mean is as follows:
n is used to refer to the number of scores in the data set (termed sample size).

8 Population Mean The algebraic definition of the population mean is as follows: N is used to refer to the number of scores in the data set (termed population size).

9 Properties of the Mean Uniqueness: For a given set of data there is one and only one mean. Simplicity: Easy understood and computed. The mean uses all the information available: Every value in the given set of data is used in the computation; it is therefore affected by every value. Extreme values have an influence on the mean and in some cases, can so distort it that it becomes undesirable as a measure of location. It may not be "typical" when there are extreme values present.

10 Median Median: The middle point of the distribution, or the score which divides the set of scores into two equal parts. Median = value of the (n + 1)/2 observation NOTE! When determining the median, you must arrange the scores in ascending or descending order first! ** Need time to write

11 Median If there are an ODD number of scores, the median is the middle score: 1, 3, 6, 7, 8, 13, 15, 17, 18, 21, 23 Median = value of the (n + 1)/2 observation, (11+1)/2=6. Look at the value of the 6th observation, Median = 13 There are 5 scores above the median, and 5 below. Need time to write

12 Median If there are an EVEN number of scores, the median is the midpoint between the two middle scores: 1, 3, 6, 7, 8, 13, 15, 17, 18, 23 Median = the value of the (n + 1)/2 observation (10+1)/2=5.5. Look at the mean value of the 5th and 6th observation Median = (8 + 13)/2 = 10.5 Need time to write

13 Steps to Finding the Median
Arrange data in ascending or descending order. Count the number of scores (N). If there are an odd number of scores, find the middle point - this is the median. If there are an even number of scores, find the 2 middle scores - add them, and divide by 2 - this is the median. Need time to write

14 Properties of the Median
Uniqueness: As is true with the mean, there is only one median for a given set of data. Simplicity: The median is easy to calculate. It is not drastically affected by extreme values as in case of the mean. Since it uses the middle value of the data set. The median is not a very reliable measure. The median does not use all the information available: Will be discussed to a greater extend in the graphs lecture

15 Mode Mode: The most frequently occurring score in a set of data.
Age Frequency 19 159 20 219 21 172 22 146 23 123 25 83 27 48 29 16 32 40 14 Our example: 20 is the most frequently occurring age in our sample Therefore the mode of this distribution is 20 This is a unimodal distribution Time to write

16 Mode Mode: The most frequently occurring score in a set of data.
Age Frequency 19 159 20 219 21 172 22 146 23 123 25 27 48 29 16 32 40 14 Our example: 20 and 25 are the most frequently occurring age in our sample Therefore the mode of this distribution is 20 and 25 This is a Bimodal distribution

17 Mode Mode: The most frequently occurring score in a set of data.
Age Frequency 19 159 20 219 21 172 22 146 23 123 25 27 48 29 16 30 40 14 Our example: 20, 25 and 30 are the most frequently occurring age in our sample Therefore the mode of this distribution is 20,25 and 30. This is a Multimodal distribution

18 Mode Mode: The most frequently occurring score in a set of data.
Age Frequency 19 120 20 21 22 23 25 27 29 32 40 Our example: All the scores have the same frequency Therefore the data has no mode

19 Mode Example Our example:
Age Frequency 19 159 20 219 21 22 146 23 123 25 83 27 48 29 16 32 40 14 Our example: 20 & 21 are the most frequently occurring age in our sample Therefore the mode of this distribution are 20 & 21 This is a bimodal distribution This is an example of a frequency distribution, which will be discussed during the class Thursday’s class on graphs. ** Time to write… go through in detail.

20 Ordered Data Age of patients: Modes = 11, 13, 17, 20, 21, 22

21 Properties of the Mode Like the median, it does not take into account all of the data - only the one most frequently occurring score. May appear in a distribution in places other than the centre. The score with the highest bar in a histogram, or the highest point in a frequency polygon. The only valid measure of central tendency for nominal data. The least frequently used measure of central tendency as it does not lend itself to mathematical operations. These graphs will be covered in a subsequent lecture

22 Effect of outliers on mean & median With outliers
Patients age Mean = 18.3 Median = 18

23 Effect of outliers on mean & median Without outliers
Patients age Mean = 16.9 Median

24 Measures of Position Quartiles, Deciles and Percentiles:
They locate special point, they break distributions into x number of points. If a set of data is arranged in order of magnitude, the middle value, which divides the set into two equal parts, is the median. By extending this idea we can think of these values which divide the set into four, ten and hundred parts.

25 Quartiles One of the most frequently used quantiles is the quartile.
Quartiles divide the values of a data set into four subsets of equal size, each comprising 25% of the observations. To find the first, second, and third quartiles: 1. Arrange the N data values into an array. 2. First quartile, Q1 = data value at position (N + 1)/4 3. Second quartile,Q2 =data value at position 2(N+1)/4 4. Third quartile, Q3 = data value at position 3(N + 1)/4

26 Quartiles Example Weight of patients:
175, 260, 150, 165, 170, 180, 190, 210, 210, 235, 240, 270 Step 1: Rank data and divide into 4 parts: 150, 165, , 180, , 210, , 260, 270 Q1 Q2 Q3 Step 2 Q1 = ( )/2 = 172.5 Q2 = ( )/2 = 200.0 Q3 = ( )/2 = 237.5

27 Quartiles Example Q1 = 1st quartile, is the value such that 1/4 of the observations are less or equal to that quartile e.g.: 2, 2, 4, 6, 7, 7, 8, 9, 10, 10, 10, 12 To find which value use the following formula: (n + 1)/4 = 13/4 = 3.25 Q1 = 5 Q2 = 2nd quartile = median 2 (n + 1)/4 = 26/4 = 6.5 Q2 = 7.5 Q3 = 3rd quartile, is the value such that 3/4 of the observations are less or equal to that quartile 3 (n + 1)/4 = 39/4 = 9.75 Q3 = 10

28 Deciles & Percentiles Similarly the values which divide the data into ten equal parts are called deciles and are denoted by D1, D2,....., D9, while the values dividing the data into one hundred parts are called percentiles and are denoted by P1, P2,....., P99. E.g.: 90th percentile, is the value such that 90% of the observations are less or equal to.

29 The value below / above which a particular percentage of values fall
Percentile The value below / above which a particular percentage of values fall (median is the 50th percentile) e.g 5th percentile - 5% of values fall below it, 95% of values fall above it. A series of percentiles (1st, 5th, 25th, 50th, 75th, 95, 99th) gives a good general idea of the scatter and shape of the data 1st 5th 25th 50th 75th 95th 99th Range 5’6” ’7” 5’8” 5’9” ’10” 5’11” ’ ’1” 6’2” 6’3” ’4”

30 Measures of Dispersion or Variability
Consider the following two data sets on the ages of all patients suffering from bladder cancer (BC) and prostatic cancer (PC). The mean age of the two groups is 40 years. If we do not know the ages of individual patients and are told only that the mean age of the patients in the two groups is the same, we may deduce that the patients in the two groups have a similar age distribution. Variation in the patient’s ages in each of these two groups is very different. The ages of the prostatic cancer patients have a much larger variation than the ages of the bladder cancer patients. 39 45 36 40 35 38 47 BC 27 52 18 33 70 PC

31 MEASURES OF DISPERSION
In order to describe adequately a frequency distribution, it is necessary not only to determine the centre of the distribution but to have an idea about the "variation" or "dispersion" or "scatter" of the measurements. Two groups of data could have the same mean, but different variations from that mean. Mean never used as a measure of dispersion

32 e.g. Blood urea level (mg/dl) of 2 groups of 5 individuals each:

33 Measures of Variability
Measure the “spread” in the data Some important measures Range Mean deviation Variance Standard Deviation - Standard Error Coefficient of variation Interquartile Range

34 Variability The purpose of the majority of medical, behavioural and social science research is to explain or account for variance or differences among individuals or groups. Examples What factors account for the variance (or difference) in IQ among individuals? What factors account for the variance in treatment compliance among different groups of patients?

35 1- Range The range tells us the span over which the data are distributed, and is only a very rough measure of variability Range: The difference between the maximum and minimum scores (X max-X min) Example: The most amount of tips made in a night is 270 and the least is 150. Therefore, the range of tips made that night is 270 – 150 = $120 Range is the simplest measure of dispersion. It is not the best measure of dispersion as it depends entirely on the extreme scores and tells us nothing about the middle values. Also, it does not take in consideration all values in a series of scores Talk about how the range is not the best measure of dispersion because it doesn’t tell us very much about the distribution of the

36 Variation X 5 0.00 This is an example of data
with NO variability = n = = 5

37 Variation X 6 +1.00 This is an example of data
with low variability = n = = 5

38 Variation X 8 +3.00 This is an example of data
with higher variability = n = = 5

39 2- Mean deviation The best measures of dispersion should:
take into account all the scores in the distribution and should describe the average deviation of the scores around the mean. Normally, to find the average we would want to sum all deviations from the mean and then divide by n, i.e., BUT: We have a problem will always add up to zero

40 Mean Deviation The average deviation is the average of the absolute deviations (i.e. regardless the sign) of the individual observations from their mean. e.g. Blood urea level (mg/dl) for 5 individuals: This indicates that, on average, the values of x (blood urea level) deviate 11.2 mg/dl from the mean of the distribution.

41 Deviations from the mean
In any group of scores, the sum of the deviations from the mean equals zero: X X- µ n = 6 = µ = Σ X/n = µ = 33/6 = µ = 5.50 = = +2.50 = ΣX = Σ(X- µ) = 0.00

42 Variance & Standard Deviation
However, if we square each of the deviations from the mean, we obtain a sum that is not equal to zero This is the basis for the measures of variance and standard deviation, the two most common measures of variability (or dispersion) of data

43 3- Variance The sum of squared deviations from the mean divided by the number of degrees of freedom (an estimate of the population variance, n-1)

44 Disadvantages of Varience
1- The original observation are measured in certain unit BUT Varience is the square of this unit. 2- Can not be added or subtracted from the mean.

45 4- Standard Deviation Formulas
Go through the formulas and explain why n-1 for sample

46 Steps to calculate standard deviation
Compute the mean. Subtract the mean from each observation. Square each of the deviations. Sum them. Divide by one less than the number of observations (almost the mean). Take the square root.

47 Standard Deviation (SD)
The standard deviation is defined as the square root of the average of the squared deviations of the measurements from their mean, or it is the square root of the "variance".

48 Variance & Standard Deviation
X = = = 50.00 Note: The is called the Sum of Squares

49 Why use Standard Deviation and not Variance!??!
Normally, you will only calculate variance in order to calculate standard deviation, as standard deviation is what we typically want. Why? Because standard deviation expresses variability in the same units as the Mean. SD can be added or subtracted from the mean. SD take into consideration all the values in the series of observation. Example: Standard deviation of ages in a class is 3.7 years (and the variance would be years2 = (3.7)2).

50 Standard Error (SE)

51 The results are then expressed as:
"mean  SD" or "mean  SE" (136  or 136  5.35 mmHg) N.B. Variation of the data is accepted if the mean > 2.5 SD or > 10 SE.

52 May 2003 Exam:

53 Coefficient of variation (CV)
This measure is used to compare the variability or dispersion within 2 groups of data, since it is invalid to compare 2 standard deviations. Therefore, the CV is a measure of the relative but not the absolute variability.

54 e.g. In a group of individuals, compare between the variation in serum cholesterol and that of body weight. for serum cholesterol CV = 50 / 180 x 100 = 27.78 Higher variability for body weight CV = 30 / 85 x 100 = 35.29


Download ppt "Descriptive Statistics"

Similar presentations


Ads by Google