Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析.

Similar presentations


Presentation on theme: "Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析."— Presentation transcript:

1 Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

2  Numerical data: --- continuous --- discrete  Categorical data: --- nominal --- ordinal review Types of data

3 Statistics : It is a branch of applied mathematics that refers to the collection and interpretation of data, and evaluation of the reliability of the conclusions based on the data. review

4 Types of statistical analysis  Descriptive analysis : ---Data collection ---Data interpretation  Inferential analysis : ---Evaluate the reliability of the conclusions

5 Contents  Frequency distribution ★  Central tendency ★  Dispersion ( measures of variability ) ★  Tables and graphs

6 New words Frequency 频数 Proportion 比例 Percentage 百分数 Histogram 直方图 Polygon 折线图 Distribution 分布 Frequency distribution 频数分布

7 Cumulative frequency 累积频数 Cumulative proportion 累积比例 Central tendency 集中趋势 Dispersion 离散程度 Mean 均数 Arithmetic mean 算术均数 Geometric mean 几何均数

8 Median 中位数 Mode 众数 Skewness 偏度 Kurtosis 峰度 Descriptive analysis 描述分析 Inferential analysis 推断分析

9 1. Frequency distribution Id sex age 1 m 6 2 m 8 3 f 13 4 m 16 5 f 16 6 f 15 7 f 23 8 m 19 9 f 25 10 f 21 11 m 13 12 f 19 13 f 9 14 f 10 15 f 14 Frequency ( 频数 ): For a given variable, the number of times a value occurs is called its frequency. Frequency table of sex Sex Label Frequency m Male 5 f Female 10

10 Frequency table of sex Sex Label Frequency proportion -------------------------------------------------- m Male 5 33.33 f Female 10 66.67 -------------------------------------------------- Total m+f 15 100.00 Proportion or percent ( 比例或百分数 ) : The ratio of a frequency to total frequency

11 Frequency distribution: A table or a graph that list all the distinct values in a variable together with the freq and proportion of these values occurs Freq distribution of sex Sex Frequency Percentage m 5 33.33 f 10 66.67

12 Method of displaying frequency distribution of categorical data 1.Nominal data 2.Ordinal data

13 Freq distribution of nominal data Freq distribution of sex Sex Frequency Percentage m 5 33.33 f 10 66.67 Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14

14 Freq distribution of ordinal data Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14 Freq distribution of eyesight Eyesight Frequency Percentage 1-3 4 26.67 4-6 6 40.00

15 first dividing the whole interval into several un- overlapped subintervals, count how many observations lies in each subinterval to make a frequency table, take the midpoint of each subinterval as x-axis label, draw a histogram( 直方图 ) or a polygon ( 折线图 ). Method of displaying frequency distribution of numerical data

16 Freq distribution of age Age midpoint Frequency 0 ~ 5 3 10 ~ 15 9 20 ~ 30 25 3 [0-10) [10-20) [20-30] Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14 Freq distribution of numerical data

17 Histogram and polygon Histogram polygon

18 Nominal data Ordinal data Numerical data

19 Cumulative frequency and cumulative proportion Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0 Cumulative frequency ( 累计频数 ): sum of total frequency from low to a certain category Cumulative proportion ( 累计比例 ): sum of total proportion from low to a certain category

20 The plot of cumulative frequency and cumulative proportion

21  Central tendency ( 集中趋势 )  Dispersion ( 离散程度 ) The major measures of the characteristics of observations for a numerical variable

22 2. Central tendency Central tendency( 集中趋势 ) : The description of the concentration near the middle of the range of all values in a variable. The major measures of central tendency are: mean, median, mode.

23 The mean sample mean The mean ( 均数 ) : It is a measure of the average level of all observations in a variable, it is defined as follow: population mean ---------Arithmetic mean ( 算术均数 )

24 Eg1a: Estimate the mean The data listed below is the content of haemoglobin (g/L) ( 血色素 ), estimate the mean. Solution : = (121+118+…+125+132)/12 = 123.5 So, the estimated mean of the Haemoglobin is 123.5 g/L. n=12 id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132 Data:

25 Another formula for mean x freq x1 f1 x2 f2 …… …… xk f k n Data: If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow: Formula:

26 Eg1b: Estimate the mean Serum Mid- Cholest. point Freq. 2.5 ~ 3.0 9 3.5 ~ 4.0 32 4.5 ~ 5.0 42 5.5 ~ 6.0 15 6.5 ~ 7.0 3 101 data: The following data are measured serum cholesterol ( 血清胆固醇 ) from 101 aged 30-49 men. Estimate the mean. Solution: n=101, =(3×9+4×32+5×42+ 6×15+7×3) / 101 = 4.71 (mmol/L)

27 The median The median ( 中位数 ): It is a middle measure in an ordered values of all observations in a variable. It is defined as below: population median sample median In which, are ordered values in pop, the are ordered values in sample. the

28 eg, if n=9, then m=x((9+1)/2)=x(5)=x5 if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2 The method of estimating the median: 1)Order all values of observations in a variable from smaller to larger; 2)If n is odd, find out middle one observation, this value is the required median; 3)If n is even, find out middle two observations, the average of this two values is the required median.

29 Eg2a: Estimate the median The data listed below is the content of haemoglobin (g/L), estimate the median. Solution : med= (122+124)/2=123 So, the median of the Haemoglobin is 123 g/L. The ordering values are: 116,118,118,120,121,122, 124,125,127,129,130,132. n=12, is even, therefore, id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132 Data:

30 Eg2b: Estimate the median The following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median. Serum Mid- Cholest. point Freq. 2.5 ~ 3.0 9 3.5 ~ 4.0 32 4.5 ~ 5.0 42 5.5 ~ 6.0 15 6.5 ~ 7.0 3 Data: Solution: Since n=101 is odd number, so the median is middle one value, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is 4.5+(5.5-4.5) / 42×10=4.74

31 Frequency distribution about mean and median Mean=4.71Median=5.0

32 median mean median positive or right skewed negative or left skewed Skewed distribution

33 Comparing mean and median meanmedian more (actual values) less (ranks) not available for ordinal data available for any data symmetric + skewed - skewed Mean=median Mean>median Mean<median information data available size in magnitude

34 The definition of median  The median is a value for which no more than half the data are smaller than it and no more than half the data are larger than it.  eg, 12, 14, 14, 15, 16, 16, 16, 17, 18. M=16, for which, four M.

35 The Geometric mean When distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency.

36 Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean.

37 The mode It is a relatively great concentration. If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13 then the mode is 8 and 12. The mode ( 众数 ): It is defined as the most frequently occurring values in a set of data.

38 Summary Frequency distribution Histogram & polygon Measures of central tendency Measures of dispersion

39 Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram. Frequency distribution of birthweight

40 New words Dispersion 离散程 度 Range 全距 Deviation 离均差 Variance 方差 Standard deviation 标准差 Coefficient of variation 变异系 数

41 New words Quartile 四分位数 Percentile 百分位数 Inter-quartile interval 四分位间 距

42 §3. Dispersion Dispersion ( 离散程度 ) : The indication of a spread of measurements around the center of a variable distribution The major measures of dispersion are: range, variance, standard deviation, inter- quartile range, coefficient of variation, etc.

43 The range The range ( 全距 ): It measures the distributed length of data. Range = max - min Population range Range = max - min Sample range * It is a simple measure, it has the same unit as the original data. # It use less information (only max & min); # Sample range underestimates the pop range—biased, inefficient # It convey no information about the middle of the distribution.

44 The quartiles The first-quartile ( 第一四分位数 ) Q1: It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it. X1Xn M ≤25%≤ 75%

45 The second-quartile ( 第二四分位数 ) Q2=M: It is a value, for which no more than 50% of observed values are less than it, and no more than 50% of observed values are greater than it. X1Xn M ≤50%

46 The third-quartile ( 第三四分位数 ) Q3: It is a value, for which no more than 75% of observed values are less than it, and no more than 25% of observed values are greater than it. X1Xn M ≤75% ≤ 25%

47 X1Xn M ≤ 50% Location of quartiles Q2 Q1 Q3 ≤ 25%

48 The method of estimate the quartiles If the subscript is not an integer or half-integer,then it is rounded up to a nearest integer or half-integer.

49 A B 34 36 37 39 40 41 42 43 79 44 45 -------------- n=9 n=10 Eg1: Estimate the quartiles

50 The inter-quartile range ( 四分位数间距 ) : It is a the difference between Q1 and Q3: Q3-Q1. X1Xn M Middle 50% Q1Q3

51 A B 34 36 37 39 40 41 42 43 79 44 45 -------------- n=9 n=10 Eg2: Estimate the interquartile range Interquartile tange of A=42.5-36.5=6.0 Interquartile tange of A=43.5-37.0=6.5

52 The percentiles Theαth percentile (α 百分位数 ) Pα : It is a value , for which no more than α% of data less than it, and no more than α% larger than it, where , 0 ≤ α≤100. P0=min, p100=max P25= Q1, P50= Q2=M, P75= Q3.

53 If the subscript is not an integer or half- integer, then it is rounded up to a nearest integer or half-integer. The method of estimate the percentiles

54 Eg3: Estimate the percentiles  For data A: P0=34, P10=34, P20=36, P30=37, …, P90=79, P100=79.  For data B: P0=34, P10=34, P20=36, P30=37, …, P90=44, P100=45.  Note: there are many ways to estimate percentiles, the results are not unique. Data: A B 34 36 37 39 40 41 42 43 79 44 45 ------------- n=9 n=10

55 The variance Population varianceSample variance note: degree of freedom are not same: N and n-1. * It convey information about the middle of the distribution. * S 2 is a unbiased estimate of σ 2, they are positive values; # The unit is not same as the original data. The variance (Var, 方差 ): It measures the average dispersion of the data about the mean.

56 Simplify formulas of variance Population varianceSample variance

57 Proving of simplify formula

58 Eg4a : Estimate the variance id x x*x 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 ------------------- ∑ 15 55 Data:Solution:

59 Another formula for variance x freq x1 f1 x2 f2 …… …… xk f k n Data: kk

60 Eg4b: Estimate the variance id x f f*x f*x*x 1 1 3 3 3 2 2 3 6 12 3 3 2 6 18 4 4 1 4 16 5 5 2 10 50 ----------------------------- ∑ 15 11 29 99 Data:Solution:

61 The standard deviation The standard deviation (sd, SD, 标准差 ): It measures the average dispersion of the data about the mean. Population sdSample sd * It convey information about the mean of the distribution. * s is an unbiased estimate of σ, they are positive values; * The unit is the same as the original data.

62 Eg5: Estimate the SD id x x*x 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 ------------------- ∑ 15 55 Data:Solution:

63 The coefficient of variation The coefficient of variance (cv, CV , 变异系 数 ): It measures the relative variation about mean. Sample cv Population cv * It measures a relative variability or relative dispersion. * Its value does not depends on the unit of variable, Instead of variance or standard deviation with units. * It can be used to compare variations with different units

64 Eg6: Estimate the CV id x 1 2 3 4 5 Data: age sum: 15 mean: 3 var: 2.5 sd 1.58 cv: 52.70 id y 1 11 2 12 3 13 4 14 5 15 Data: weight sum: 65 mean: 13 var: 2.5 sd 1.58 cv: 12.16 id y 1 110 2 120 3 130 4 140 5 150 Data: weight sum: 650 mean: 130 var: 250 sd 15.8 cv: 12.16 Coding effects: (1) +- : S is unchanged; (2) ×÷: CV is unchanged.

65 Summary 1. Measures of central tendency: mean, median, mode. 2. Masures of dispersion: variance, standard deviation, range, inter-quartile, CV.


Download ppt "Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析."

Similar presentations


Ads by Google