Presentation is loading. Please wait.

Presentation is loading. Please wait.

CTRC Core Curriculum Seminar Series

Similar presentations


Presentation on theme: "CTRC Core Curriculum Seminar Series"— Presentation transcript:

1 CTRC Core Curriculum Seminar Series
Descriptive Statistics: Data Types and Measures, Central Tendency, Variability Chang-Xing Ma, PhD Associate Professor Department of Biostatistics, UB January 4, 2012

2 Disclosure Statement Chang-Xing Ma, PhD Nothing to disclose

3 Goals and Objectives Goals: Gain the knowledge of basic statistics and how to describe the data Objectives: Describe the data type Summarize data Understand Measure of Central Tendency Understand Measure of Dispersion

4 Outline Basic concepts of biostatistics Data type Summarize data
Measure of Central Tendency Measure of Dispersion

5 Some terminology Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data Biostatistics—the theory and techniques for collecting, describing, analyzing, and interpreting health data.

6 Some terminology Population refer to all measurements or observations of interest Sample is simply a part of the population. But the sample MUST represent the population. A random sample is such a representative sample The sample must be large enough The sample should be selected randomly

7 Some terminology Parameter is some numerical or nominal characteristic of a population A parameter is constant, e.g. mean of a population Usually unknown Statistic is some numerical or nominal characteristic of a sample. We use statistic as an estimate of a parameter of the population It tends to differ from one sample to another We also use statistic to test hypothesis

8 (µw,σw2), Population: all U.S. persons ~ Normal (µh,σh2),
Parameters Population: all U.S. persons ~ Normal (µh,σh2), (µw,σw2), A random sample: sample size = Gender Height Weight statistics A sample mean height: std height: mean weight std weight % of male (=1)

9 Sources of data Records Surveys Experiments Comprehensive Sample

10 Types of variables Quantitative variables Qualitative variables
continuous Qualitative nominal Quantitative discrete Qualitative ordinal

11 Data Types Numerical (Quantitative) Categorical (Qualitative)
numerical measurement Height Weight Categorical (Qualitative) with no natural sense of ordering Gender Hair color Blood type

12 Numerical Variable Continuous Discrete Age - Range of values
Height in inch Discrete Limited possible values # of smoking per day # of children in a family Age -

13 Determining Data Types
• Ordinal (Categorical) vs. Discrete (Numerical) • Ordinal – Cancer Stage I, II, III, IV – Stage II ≠ 2 times Stage I – Categories could also be A, B, C, D • Discrete – # of children: 0, 1, 2, … – 4 children = 2 times 2 children

14 Descriptive Statistics – reducing a complex mass of data to a manageable set of information
Descriptive Statistics: the summary and presentation of data to: simplify the data enable meaning full interpretation support decision making Numerical descriptive measures (few numbers) Graphical presentations

15 Inferential statistics
From a sample to estimate population parameters to test hypothesis to build the model to reflect the population

16 The student test score (FCAT)
Problem 1 Among the 6 variables, which ones are qualitative and which ones are quantitative? Is Race nominal or ordinal? Code: Race: W – White B – Black H – Hispanic A – Asian Sex: F – Female M – Male Poverty: 0 – not poor 1 – poor Student ID Race Sex Reading Math Poverty

17 Descriptive Statistics
Categorical variables: Frequency distribution Bar chart, pie chart Contingency tables Continuous variables: Grouped frequency table Central Tendency Variability

18 Simple Frequency Distribution
An ordered arrangement that shows the frequency of each level of a variable. race Frequency Percent A B H W sex Frequency Percent F M

19 Simple Frequency Distribution
It is useful for categorical variable For continuous variable, it allows you to pick up at a glance some valuable information, such as highest, lowest value. ascertain the general shape or form of the distribution make an informed guess about central tendency values

20 Bar Chart BY summarizing a set of categorical data - nominal or ordinal data It displays the data using a number of rectangles, each of which represents a particular category. The length of each rectangle is proportional to the number of cases in the category it represents can be displayed horizontally or vertically they are usually drawn with a gap between the bars Bars for multiple (usually two) variables can be drawn together to see the relationship A bar graph may be either horizontal or vertical. The important point to note about bar graphs is their bar length or height—the greater their length or height, the greater their value. Bar graphs are one of the many techniques used to present data in a visual form so that the reader may readily recognize patterns or trends. Bar graphs usually present categorical and numeric variables grouped in class intervals. They consist of an axis and a series or labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the x-axis of a bar graph or the y-axis of a column graph are called the scale. When developing bar graphs, draw a vertical or horizontal bar for each category or value. The height or length of the bar will represent the number of units or observations in that category (frequency) or simply the value of the variable. Select an arbitrary but consistent width for each bar as well.

21 Pie Chart summarizing a set of categorical data - nominal or ordinal data It is a circle which is divided into segments. Each segment represents a particular category. The area of each segment is proportional to the number of cases in that category. A pie chart is a way of summarizing a set of categorical data or displaying the different values of a given variable (e.g., percentage distribution). This type of chart is a circle divided into a series of segments. Each segment represents a particular category. The area of each segment is the same proportion of a circle as the category is of the total data set. Pie charts usually show the component parts of a whole. Often you will see a segment of the drawing separated from the rest of the pie in order to emphasize an important piece of information.

22 Complex frequency distribution Table
Distribution of 20 lung cancer patients at the chest department of Alexandria hospital and 40 controls in May 2008 according to smoking Smoking Lung cancer Total Cases Control No. % Smoker 15 75% 8 20% 23 38.33 Non smoker 5 25% 32 80% 37 61.67 20 100 40 60

23 How about continuous variables?
How data is distributed? Measure of Central Tendency Measure of Variability

24 Grouped Frequency Distribution – for continuous variable
Frequency Table DATA: Interval Size: N: µ: σ:

25 Grouped Frequency Distribution
BUT the problem is that so much information is presented that it is difficult to discern what the data is really like, or to "cognitively digest" the data. the simple frequency distribution usually need to condense even more. It is possible to lose information (precision) about the data to gain understanding about distributions. This is the function of grouping data into equal-sized intervals called class intervals. The grouped frequency distribution is further presented as Frequency Polygons, Histograms, Bar Charts, Pie Charts.

26 Describing Distributions
Bell-Shaped Distribution Normal distribution N (µ=0, σ2 =1) t-distribution µ, σ2

27 Describing Distributions
Skewed Distribution – positively skewed distribution µ, σ2

28 Describing Distributions
Skewed Distribution – negatively skewed distribution µ, σ2

29 Describing Distributions
Other Shapes Rectangular Bimodal µ, σ2

30 Describing Distributions
Other Shapes J-curve µ, σ2

31 Probability density function - Normal
z-transform green curve is standard normal distribution

32 Measure of Central Tendency Mean, Median, Mode
The Mean average value not robust to outlying value Length of hospital stays: 6, 4, 5, 9, 10, 7, 1, 4, 3, 4 Mean=( )/10=5.3

33 Measure of Central Tendency Mean, Median, Mode
The Median is the point that divides a distribution of data into two equal parts robust to outlying value Length of hospital stays: sort data median=4.5 Split Data

34 Measure of Central Tendency Mean, Median, Mode
The Mode is the midpoint of the interval that has highest frequency robust to outlying value, but sometimes misleading Length of hospital stays: sort data Mode=4, which occurred 3 times. Most frequently

35 Comparison between mean and median

36 Comparison between mean and median

37 Comparison between mean and median

38 Summary Frequency distribution Histogram, Polygon graph
Bar Chart, Pie Chart Describing Distributions Mean, Median, Mode DATASET:

39 Problem 2 In a study, we collected a medical measurements X for 4 patients Data of X: 2, 3, 5, 6 Mean of X? Median of X? Mode of ?

40 Descriptive Statistics Variability
The sample range Interquartile range The sample standard deviation (SD), variance Standard error of mean (SEM)

41 Measures of Dispersion - Range
Range – the difference between the lowest and highest For example, Age of Patients (years): lowest 2, highest 17 Range=2 -17 years When sample size increases, the range tends to increase as well. (not robust)

42 Measures of Dispersion - Range
All of curves have the same range Mean? Median?

43 Measures of Dispersion Percentiles, Deciles, Quartiles
Percentiles: based on dividing a sample or population into 100 equal parts. Deciles divide the distribution into 10 parts Quartiles divide the distribution into 4 equal parts. 1st quartile includes the lowest 25% of the values (Q1) 2st quartile includes the values from 26 percentile through 50 percentile (Q2) - median 3st quartile includes the values from 51 percentile through 75 percentile (Q3)

44 Measures of Dispersion Interquarile Range
Interquarile Range – the 25 percentile (1st quartile) to 75 percentile (3rd quartile) Age of Patients (years): 1st quartile 6, 2nd quartile 8.5, 3rd 13 Interquarile Range = years Interquarile Range is a robust estimate of data variability

45 Measures of Dispersion Interquarile Range
Robust estimate, less efficient

46 Deviations from the mean Variance and Standard Deviation
deviation: observation - mean “sum” of deviation BUT

47 Deviations from the mean Variance and Standard Deviation
Measure of how different the values in a set of numbers are from each other Variance: Standard Deviation:

48 Deviations from the mean Variance and Standard Deviation
Data set: 2,3,5,6 Calculation: Value of X (X ) (X ) ∑=0 ∑=10 Variance Standard Deviation

49 Three normal distributions: mean=0 s2=1 s2=2 s2=0.5
Leptokurtic Homogenous Narrow scatter Mesokurtic Platykurtic Heterogeneous wide scatter Central Tendency mean=0

50 Example 2: FEV1 (litres) of 57 male medical students
Table: FEV1 (litres) of 57 male medical students

51 Example 2: FEV1 (litres) of 57 male medical students
Mean: 4.06 Variance: 0.45 SD: 0.67 Q1: 3.54 Q2 (Median): 4.10 Q3: 4.52 Percentile 5.16 Range: 2.85 to 5.43

52 The Meaning of Standard Deviation
How the data are dispersed around mean Mean ± 1 SD represent 68.3% of the population Mean ± 2 SD represent 96% of the population Mean ± 3 SD represent 99.7% of the population

53 The Meaning of Standard Deviation
±SD % of Pop 1 68.3 1.96 95 2 95.5 2.58 99 3 99.7 34% 34% 1SD 1SD 2SD 48% 2SD 48%

54 Standard Error of Mean (SEM)
How confident can we be that the sample mean represents the population mean µ? SEM=SD/ SEM must be much smaller than the SD mean ± 1.96*SD cover 95% of the data mean ± 1.96*SEM cover 95% of the population mean SEM and SD are different!

55 Standard Error of Mean (SEM)
Describing the scatter or spread of data, use SD Estimate population parameters, use SEM Epidemiologic study, SEM Clinical or laboratory research, SD

56 Summarizing Data - Calculator
Put DATA below: Mean: 4.06 Variance: 0.45 SD: 0.67 Q1: 3.54 Q2 (Median): 4.10 Q3: 4.52 Percentile 5.16 Range: 2.85 to 5.43 Interval Size: N: µ: σ: Ylim:

57 Box-Plot The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range. The line in the box indicates the median value of the data. The + indicate mean value The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range. The points outside the ends of the whiskers are outliers or suspected outliers.

58 Box Plot – Example 2 Serum triglyceride measurements in cord blood from 282 babies FEV1 of 57 students

59 What you can get from a box-plot?
Graphically display a variable's location and spread at a glance. [Q1, Q2 (median), Q3, interquartile range] Provide some indication of the data's symmetry and skewness. Unlike many other methods of data display, boxplots show outliers. By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets. One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying histogram in conjunction with the boxplot helps

60 Transformations triglyceride LOG (triglyceride)

61 Summarizing data Univariate – categorical variable
Frequency distributions Bar Chart, Pie Chart

62 Summarizing data Univariate – continuous variable
Grouped frequency distributions Polygon or histogram Mean, Median, Mode, Percentile, Q1, Q2, Q3, extreme values Standard deviation, variance, range, interquartile range Box-Plot Normality test statistics

63 Next lecture ( Lecture 2)
Bivariate – one is categorical and the other is continuous variable t-test ANOVA

64 Lecture 3 – categorical data analysis
Bivariate – both are categorical Contingency tables Chi-square test Response is categorical, predictors could be both types. Logistical regression

65 Lecture 4 – Continuous response
Correlation Multiple linear regression

66 Thanks. Question?


Download ppt "CTRC Core Curriculum Seminar Series"

Similar presentations


Ads by Google