Descriptive and inferential statistics. Confidence interval

Slides:



Advertisements
Similar presentations
Statistics Review.
Advertisements

Calculating & Reporting Healthcare Statistics
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Measures of Dispersion
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Rules of Data Dispersion By using the mean and standard deviation, we can find the percentage of total observations that fall within the given interval.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Review Measures of central tendency
Describing distributions with numbers
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Measures of Center vs Measures of Spread
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
 The mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency.  The sample mean is written.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
LIS 570 Summarising and presenting data - Univariate analysis.
Introduction to statistics I Sophia King Rm. P24 HWB
MODULE 3: DESCRIPTIVE STATISTICS 2/6/2016BUS216: Probability & Statistics for Economics & Business 1.
Descriptive Statistics(Summary and Variability measures)
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Lecture 8 Data Analysis: Univariate Analysis and Data Description Research Methods and Statistics 1.
Outline Sampling Measurement Descriptive Statistics:
Descriptive Statistics ( )
Exploratory Data Analysis
Descriptive Statistics Measures of Variation
Business and Economics 6th Edition
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
2.5: Numerical Measures of Variability (Spread)
Numerical Descriptive Measures
Statistics.
Describing, Exploring and Comparing Data
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
NUMERICAL DESCRIPTIVE MEASURES
IB Psychology Today’s Agenda: Turn in:
IB Psychology Today’s Agenda: Turn in:
Description of Data (Summary and Variability measures)
Univariate Descriptive Statistics
Univariate Descriptive Statistics
Numerical Descriptive Measures
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Descriptive Statistics
An Introduction to Statistics
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Describing Data with Numerical Measures
BUS173: Applied Statistics
Numerical Descriptive Measures
Quartile Measures DCOVA
Introduction to Biostatistics
Summary (Week 1) Categorical vs. Quantitative Variables
Summary (Week 1) Categorical vs. Quantitative Variables
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Descriptive Statistics
Advanced Algebra Unit 1 Vocabulary
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Business and Economics 7th Edition
Numerical Descriptive Measures
Presentation transcript:

Descriptive and inferential statistics. Confidence interval Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine

Outline Descriptive statistics Measures of central tendency Measures of spread Normal distribution Central limit theorem Outliers Inferential statistics Confidence interval

Population vs Sample Population includes all objects of interest, whereas sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (μ, σ), while statistics are usually denoted using Roman letters (x, s).

Descriptive vs Inferential statistics We compute statistics and use them to estimate parameters. The computation is the first part of the statistical analysis (Descriptive Statistics) and the estimation is the second part (Inferential Statistics). Descriptive Statistics The procedure used to organise and summarise masses of data. Inferential Statistics The methods used to find out something about a population, based on a sample.

Inferential statistics Population Parameters Sampling From population to sample Sample Statistics From sample to population Inferential statistics

Descriptive statistics Organising data Tables Graphs Summarising data Central tendency (location) Variation (spread)

Descriptive statistics Organising data Tables Frequency distributions Relative frequency distributions Graphs Bar chart Histogram Box plot

Frequency distribution Frequency distribution of survival for both groups Survival Frequency 14 2 17 1 21 1 22 2 23 1 24 2 25 1 27 1 28 1 29 1 31 1 33 1 34 2 35 1 39 1 41 1 Total 20 Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Classes of values Control group (10 patients) Individual survival in months: 24 31 39 35 34 24 14 21 25 22

Relative frequency distribution Relative frequency distribution of survival for both groups Survival Frequency Percent Cumulative percent 14 2 10% 10% 17 1 5% 15% 21 1 5% 20% 22 2 10% 30% 23 1 5% 35% 24 2 10% 45% 25 1 5% 50% 27 1 5% 55% 28 1 5% 60% 29 1 5% 65% 31 1 5% 70% 33 1 5% 75% 34 2 10% 85% 35 1 5% 90% 39 1 5% 95% 41 1 5% 100% Total 20 100% 9

Grouped relative frequency distribution Relative frequency distribution of survival for both groups Survival Frequency Percent Cumulative Percent 10 – 14 2 10% 10% 15 – 19 1 5% 15% 20 – 24 6 30% 45% 25 – 29 4 20% 65% 30 – 34 4 20% 85% 35 – 39 2 10% 95% 40 – 44 1 5% 100% Total 24 100% Classes of intervals What rules to follow when groupping data?

Descriptive statistics Summarising data: Central tendency (or sample’s middle value) Mean Median Mode Spread (or summary of differences within groups) Range Interquartile range Variance Standard deviation

Mean Most commonly called average. Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Control group (10 patients) Individual survival in months: 24 31 39 35 34 24 14 21 25 22

Mean Mean is the balance point. Means can be heavily affected by outliers (data points with extreme values unlike the rest). Outliers can make the mean a bad measure of central tendency or common experience.

Median The middle value when a variable’s values are ranked in order. The point that divides a distribution into two equal halves. When data are listed in order, the median is the point at which 50% of the cases are above and 50% below it. The 50th percentile.

Median Control group (10 patients) Individual survival in months: 14 21 22 24 25 31 34 35 39 Median = 24.5 (five cases above, five below)

Median The median is unaffected by outliers, making it a better measure of central tendency, better describing the “typical person” than the mean when data are skewed. If the recorded values for a variable form a symmetric distribution, the median and mean are identical. In skewed data, the mean lies further toward the skew than the median.

Mode The most common data point is called the mode. Individual survival data for the control group are: 14, 21, 22, 24, 24, 25, 31, 34, 35, 39 It is possible to have more than one mode. If all values are unique, there is no mode. Mode may mot be at the center of a distribution.

Mode It may give you the most likely experience rather than the typical or central experience. In symmetric distributions, the mean, median and mode are the same. In skewed data, the mean and median lie further toward the skew than the mode. Skewed Symmetric Mean Median Mode Mode Median Mean

Spread Variation of the recorded values on a variable. The larger the spread, the further the individual cases are from the mean. The smaller the spread, the closer the individual scores are to the mean. Mean Mean

Range The spread, or the distance, between the lowest and highest values of a variable. To get the range for a variable, you subtract its lowest value from its highest value. Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Range = 41 – 14 = 27 Control group (10 patients) Individual survival in months: 24 31 39 35 34 24 14 21 25 22 Range = 39 – 14 = 25

Standard deviation Standard deviation takes into account all individual deviations. A deviation is the distance away from the mean of a case’s score. Experimental group’s SD = 8.13 months Control group’s SD = 7.64 months

Standard deviation The larger standard deviation, the greater amounts of variation around the mean. Standard deviation is equal to 0, only when all values are the same. Like the mean, the standard deviation will be inflated by an outlier case value.

Interquartile range The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively. IQR is equal to Q3 minus Q1.

Central tendency and spread Central tendency: Mean, mode and median Spread: Range, interquartile range, standard deviation Mistakes: Focusing on only the mean and ignoring the variability Standard deviation and standard error of the mean Variation and variance What is best to use in different scenarios? Symmetrical data: mean and standard deviation Skewed data: median and interquartile range

Important rules When a constant is added to every observation, the new sample mean is equal to original mean plus the constant. When a constant is added to every observation, the standard deviation is unaffected. When every observation is multiplied by the same constant, the new sample mean is equal to original mean multiplied by the constant. When every observation is multiplied by the same constant, the new sample standard deviation is equal to original standard deviation multiplied by the magnitude of the constant.

Normal (Gaussian) distribution Mean and standard deviations are a particularly appropriate summary for data whose histogram approximates a normal distribution (the bell-shaped curve). If you say that a set of data has a mean survival of 29 months, the typical listener will picture a bell-shaped curve centered with its peak at 29 months.

Rule of 3-sigma When data are approximately normally distributed: approximately 68% of the data lie within one SD of the mean; approximately 95% of the data lie within two SDs of the mean; approximately 99% of the data lie within three SDs of the mean.

Normal (Gaussian) distribution Central limit theorem: Create a population with a known distribution that is not normal; Randomly select many samples of equal size from that population; Tabulate the means of these samples and graph the frequency distribution. Central limit theorem states that if your samples are large enough, the distribution of the means will approximate a normal distribution even if the population is not Gaussian. Mistakes: Normal vs common (or disease free); Few biological distributions are exactly normal.

Outliers Values that lie very far away from the other values in the data set.

Outliers Outliers can occur for several reasons: Mistakes: Invalid data entry Biological diversity Random chance Experimental error Skewed distribution Mistakes: Not realizing that outliers are common in data sampled from skewed distribution Eliminating outliers only when you do not get the results you want

Outliers Outlier test: If values are sampled from a normal distribution, what is the chance one value will be as far from the others as the extreme value observed? Examples: Chauvenet criterion, Grubbs test, Peirce criterion Nevertheless, deletion of outlier data is generally a controversial practice!

Inferential statistics Population Parameters Sampling From population to sample Sample Statistics From sample to population Inferential statistics

Confidence interval for the population mean Population mean: point estimate vs interval estimate Standard error of the mean – how close the sample mean is likely to be to the population mean. Assumptions: a random representative sample, independent observations, the population is normally distributed (at least approximately). Confidence interval depends on: sample mean, standard deviation, sample size, degree of confidence. Mistakes: 95% of the values lie within the 95% CI; A 95% CI covers the mean ± 2 SD.

Standard error of mean The sample mean estimates individual values. The uncertainty with which this mean estimates individual values is given by the standard deviation. The sample mean estimates the population mean. The uncertainty with which this mean estimates the population mean is given by the standard error of the mean.

Confidence interval for the population mean The confidence interval for the mean gives us a range of values around the mean where we expect the “true” population mean is located. 95% confidence interval for the population mean is:

Confidence interval for the population mean The duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods (in years) of a random sample of 30 HIV infected individuals are: 12.0, 10.5, 9.5, 6.3, 13.5, 12.5, 7.2, 12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9, 12.0, 6.3, 7.8, 6.3, 12.5, 5.2, 13.1, 10.7, 7.2. Calculate the 95% CI for the population mean incubation period in HIV. X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years)

Confidence interval for the population mean X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years) 99% level of confidence => Z = 2.58 µ = 9.5 ± (2.58 x 0.5) = 9.5 ± 1.3 years 99% CI for µ is (8.2; 10.8 years)

Describing qualitative data Improvement No improvement Total Gluten-free diet 54 46 100 No gluten-free diet 47 53 101 99 200

Describing qualitative data Standard error of proportion: The 95% confidence interval for a population proportion is: