Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South.

Slides:



Advertisements
Similar presentations
Richard M. Jacobs, OSA, Ph.D.
Advertisements

Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
Unit 1: Science of Psychology
Introduction to Summary Statistics
Introduction to statistics in medicine – Part 1 Arier Lee.
Statistics. Review of Statistics Levels of Measurement Descriptive and Inferential Statistics.
Statistical Tests Karen H. Hagglund, M.S.
© Biostatistics Basics An introduction to an expansive and complex field.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Descriptive Statistics
Methods and Measurement in Psychology. Statistics THE DESCRIPTION, ORGANIZATION AND INTERPRATATION OF DATA.
Introduction to Educational Statistics
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Very Basic Statistics.
Data observation and Descriptive Statistics
Thomas Songer, PhD with acknowledgment to several slides provided by M Rahbar and Moataza Mahmoud Abdel Wahab Introduction to Research Methods In the Internet.
Measures of Central Tendency
Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately describes the center of the.
BIOSTATISTICS II. RECAP ROLE OF BIOSATTISTICS IN PUBLIC HEALTH SOURCES AND FUNCTIONS OF VITAL STATISTICS RATES/ RATIOS/PROPORTIONS TYPES OF DATA CATEGORICAL.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Eng.Mosab I. Tabash Applied Statistics. Eng.Mosab I. Tabash Session 1 : Lesson 1 IntroductiontoStatisticsIntroductiontoStatistics.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Descriptive Statistics
M07-Numerical Summaries 1 1  Department of ISM, University of Alabama, Lesson Objectives  Learn when each measure of a “typical value” is appropriate.
Chapter 2 Describing Data.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Descriptive Statistics
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Day 2 Session 1 Basic Statistics Cathy Mulhall South East Public Health Observatory Spring 2009.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Statistics 11 The mean The arithmetic average: The “balance point” of the distribution: X=2 -3 X=6+1 X= An error or deviation is the distance from.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Psychology 101. Statistics THE DESCRIPTION, ORGANIZATION AND INTERPRATATION OF DATA.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Medical Statistics as a science
Chapter Eight: Using Statistics to Answer Questions.
Unit 2 (F): Statistics in Psychological Research: Measures of Central Tendency Mr. Debes A.P. Psychology.
Chapter 6: Analyzing and Interpreting Quantitative Data
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
LIS 570 Summarising and presenting data - Univariate analysis.
Introduction to statistics I Sophia King Rm. P24 HWB
Introduction to Medical Statistics. Why Do Statistics? Extrapolate from data collected to make general conclusions about larger population from which.
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Chapter 3: Central Tendency 1. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Chapter 2 Describing and Presenting a Distribution of Scores.
Descriptive Statistics(Summary and Variability measures)
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Descriptive Statistics Dr.Ladish Krishnan Sr.Lecturer of Community Medicine AIMST.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Chapter 11 Summarizing & Reporting Descriptive Data.
Doc.RNDr.Iveta Bedáňová, Ph.D.
Chapter 5 STATISTICS (PART 1).
Basic Statistics Overview
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Central tendency and spread
Basic Statistical Terms
Descriptive and inferential statistics. Confidence interval
An introduction to an expansive and complex field
Numerical Descriptive Statistics
Presentation transcript:

Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South East Public Health Observatory

Overview  What is/are statistics?  Summarising data  Types of data  Normal distribution  Confidence intervals  Hypothesis testing  Significance and p-values

What is ‘statistics’? Statistics is the science of:  collecting  summarising  analysing  and interpreting sets of data. Statistics is a way to get information from data. That’s it!” Gerald Keller

What are statistics? Statistics are:  the numerical facts or data themselves  numbers derived from a sample of data that are describing some characteristics of a sample Some key concepts:  population  sample  unit  data  parameter vs. statistic.

Key statistical concepts  Population  is the group of all items of interest to a statistical practitioner  is frequently very large and sometimes infinite.

Key statistical concepts  Population  Sample  is a part or subset of the population used to gain information about the population  is still potentially large, but less than the population.

Key statistical concepts  Population  Sample  Unit  any individual member of the population/sample.

Key statistical concepts  Population  Sample  Unit  Data  is measurements that have been collected.

Key statistical concepts  Population  Sample  Unit  Data  Parameter or statistic  is a descriptive measure of a population or sample.  Inference is the process of making decisions about population based on information contained in a sample from the population  Sample should be selected though an appropriate process e.g. using random procedure.  Sample selection in health intelligence?  all hospital admissions in England  number of COPD sufferers on GP register in England

Types of data Numerical Categorical Quantitative: Qualitative: counted or measured characterises a quality Examples: Examples: Weight: 70kg Gender: male/female Height: 168 cm Smoking status: smoker 1 Age: 45 years ex-smoker 2 Blood pressure: 120 mmHg non-smoker 3 Numbers can be added, multiplied, averaged etc Any numbers used as labels cannot be added, multiplied, averaged etc

Numerical data Discrete Continuous Integers Any value on the scale (whole numbers) Examples: Examples: Number of people Height Number of admissions Weight Number of prescriptions Usually counts Usually measurements The distinction between discrete and continuous data is often not necessary for the purpose of analysis

Categorical data Nominal Ordinal No natural order Have a natural order Examples: Examples: Gender Social class Blood group Cancer staging Numerical data can be made Categorical: e.g. blood pressure > 90 mmHg  Hypertensive ≥ 90 mmHg  Normal Categorical data are often coded numerically for computer data entry e.g. male = 1 female = 2 This does not make it numerical data

Some notes to type of data…  Type of data applies to individual measurements, not to summary group statistics e.g. In a sample of patients, 30/100 (30%) are male.  In health intelligence, the summary statistics is often the individual measurement of interest.  The individual unit of analysis is more likely to be PCT, local authority, region, or SHA than an individual person. Examples:  Ethnic origin of an individual  % BME for a PCT  Individual death under 75 years from CHD  Under 75 rate for CHD for PCT

Which types of data are the following?  Birthweight  Marital status  Pain scale  Age at last birthday  Exact age  Number of visits to GP last year  Cancer staging  Cholesterol level  Number of colds last year  Length of hospital stay  Mortality

Which types of data are the following?  Birthweight: 3.050kg  numerical continuous  Marital status: married  categorical nominal  Pain scale (mild, moderate, sever): mild  categorical ordinal  Age at last birthday: 21 years  numerical discrete  Exact age: 21 years and 6 months (21.5)  numerical continuous  Number of visits to GP last year: 5  numerical discrete  Cancer staging: II  categorical ordinal  Cholesterol level: 4.6 mmol/l  numerical continuous  Number of colds last year: 2  numerical discrete  Mortality (dead/alive): alive  categorical nominal  Length of hospital stay: 7 days  numerical discrete

Summarising data  Often referred to as Descriptive Statistics or Summary Statistics  Graphical techniques i.e. frequency distribution.  Numerical techniques  measure(s) of central location  measure(s) of variation (dispersion).  … methods of organising, summarising and presenting data in a convenient and informative way.

Some data Diastolic pressure readings in 120 patients

Graphical techniques  Frequency distribution: lists data values by groups of intervals along with their corresponding frequencies (or counts). Histogram is a graphical display of tabulated frequencies. Diastolic pressureFrequency

Graphical techniques  Frequency distribution  Relative frequency: determined in the same ways as frequency distribution except that it consists of the proportions of occurrences instead of the numbers of occurrences for each group. Diastolic pressure Relative frequency % % % % % % % % % %

Graphical techniques  Frequency distribution  Relative frequency  Cumulative frequency: is the running total of the frequencies. On a graph, it can be represented by a cumulative frequency curve. Diastolic pressure <= Cumulative Frequency

Graphical techniques  Frequency distribution  Relative frequency  Cumulative frequency  Cumulative relative frequency: determined in the same ways as cumulative frequency distribution except that it consists of the proportions instead of the numbers. Diastolic pressure <= Cumulative Relative Frequency 592.5% 644.2% % % % % % % % %

Numerical techniques Measure(s) of central location  Mean: the measure of centre found by adding the values and dividing the total by the number of values. mean =

Numerical techniques Measure(s) of central location  Mean = 82  Median: the measure of centre that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If the number of values is even, the median is found by calculating the mean of the two middle values. median =

Numerical techniques Measure(s) of central location  Mean = 82  Median = 82.5  Mode: the value that occurs most frequently. mode = 89.1 and 98 and 85.8 and 85.9  multimodal  one mode Example: The number of visits made to the GP in 1 year by 21 patients 0,0,0,1,1,2,2,2,3,3,4,4,4,4,5,5,6,8,18,41,55  no mode Example: The age of the 21 patients who visited the GP in 1 year 2,5,6,7,12,15,21,30,55,59,60,61,69,70,71,73,80,85,87,89,90

Mean and median? Mean Median a) 3, 4, 5, 6, 7 b) 9, 10, 20, 21 c) 1, 2, 3, 4, 990

Mean and median? Mean Median a) 3, 4, 5, 6, b) 9, 10, 20, c) 1, 2, 3, 4,

What is the best measure of central tendency? No single best answer!

What is the best measure of central tendency? Measure of CentreHow Common?Existence Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages Mean Most familiar ‘ average ’ Always existsYes Works well with many statistical methods MedianCommonly usedAlways existsNo Often a good choice if there are some extreme values ModeSometimes used Might not exist; may be more than one mode No Rarely used in health related and medical statistics

Numerical techniques Measure(s) of variation  Range: difference between highest and lowest values.  poor measure of variance  sensitive to extreme values  however, often reported i.e. 82 (54.1, 101.7)

Numerical techniques Measure(s) of variation  Range  Inter-quartile range: difference between upper quartile and lower quartile  lower quartile has ¼ values smaller than it  upper quartile has ¼ values larger than it Box and whisker plots

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles: value below which a given proportion lies  divide the rank data into 100 groups  1 st percentile: 1% of data below, 99% above  5 th percentile: 5% of data below, 95% above  10 th percentile: 10% of data below, 90% above etc.  could describe the spread by the difference between 10 th and 90 th percentile  or the ratio of the 90 th percentile to the 10 th percentile

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles  Variance: a measure of the variation of a set of data points around their mean value.  step 1: calculate deviations (the difference between each observation and the mean of the data)  step 2: square these deviations  step 3: average the squared deviations (strictly divide by n-1, not n)  used in a variety of statistical tests, but on its own it is of limited practical use since it is squared value

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles  Variance  Standard deviation: a measure of the dispersion of a data set from its mean. Standard deviation is calculated as the square root of variance.  more useful measure of variation as returns the statistic to the same unit as the data

What is the best measure of variation? Measure of variation Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages RangeYes Poor measure of variance Inter-quartile rangeNo Often a good choice if there are some extreme values VarianceYes Reported as square value Standard deviationYes More useful than variance as in the same unit s as the data  Symmetric data: mean and standard deviation  Skewed data: median and inter-quartile range

Summarising Categorical data  Percentages and rates  Covered in Days 3 - Basic Analytical Techniques

Normal (N) distribution Mean = Median = Mode  Symmetrical  Bell shaped  Standard normal distribution mean=0 SD=1  Represents the distribution of values if whole population was studied

Normal distribution Changes in mean

Normal distribution Changes in standard deviation

Normal distribution  defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores

Normal distribution  Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores  Published data tables listing the area under the Standard Normal Curve

Normal distribution  Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores  Published data tables listing the area under the Standard Normal Curve  Used to calculate area between 2 points

Importance of N distribution  Many biological variables are N distributed or can be made N distributed by transformation  For some health related or medical data normal distributions are rare  Samples from a population that is normally distributed will not necessary look normal themselves, especially if sample is small  Normality can be assessed visually but better to use significance tests and normal plots

Skewed data

Transforming skewed data

Populations and samples  Samples used to provide estimates of population values  Will the sample give the right answer?  Bias and random error  Bias (systematic bias): the sample is selected in such a way that even a very large sample will not represent the true answer  select sample using appropriate process e.g. random sampling  measure the variable accurately  Random error: caused by any factors that randomly affect measurement of the variable across the sample.  different samples will give different answers  Good sample: large and randomly selected

How good is the sample?  Two measures of precision  Standard error: measures the amount of variability in the sample estimates. It indicates how closely the population mean or proportion is likely to be to the sample estimate. Mean, Proportion,

How good is the sample?  Two measures of precision  Standard error  Confidence intervals based on the Normal distribution, 95% sample estimates will be within 1.96 SEs from the true value provides a range of values within which the true (population) value is likely to lie for 95% of samples this interval will contain the true population value for any one sample there is a 95% chance that the interval contains the true value 5% risk (or 1 in 20 chance) that true value lies outside the 95% interval Narrow 95% CI  precise estimate Wide 95% CI  imprecise estimate

Some notes on confidence intervals  95% reference range  a measure of the spread of the data  contains 95% of the observations  95% confidence intervals  a measure of precision of a sample estimate  95% probability that the interval contains the true population value

Self-reported smoking status in women (%), by ethnic group with 95% confidence intervals (England, 2004)

Interpretation of confidence intervals  Non overlapping intervals indicative of real differences.  Overlapping intervals need to be considered with caution.  Need to be careful about using confidence intervals as a means of testing.  The smaller the sample size, the wider the confidence interval.

Interpretation of confidence intervals  What can we say about the true smoking prevalence for the general population?  For which ethnic groups is the prevalence of smoking significantly different from 25%?  Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations?  Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations?

Interpretation of confidence intervals  What can we say about the true smoking prevalence for the general population? 95% confident that the true smoking prevalence for the general population is between 22.5 and 24.5%  For which ethnic groups is the prevalence of smoking significantly different from 25%? For Black African, Indian, Pakistani, Bangladeshi and Chinese the prevalence of smoking is significantly different from 25%  Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations? The prevalence of smoking is significantly different between Black Caribbean and Black African groups  Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations? Cannot be sure that the prevalence of smoking is significantly different between the Pakistani and Bangladeshi populations

Hypothesis testing  Inference about population are often based upon a sample.  Descriptive statistics describe the data set, but doesn’t allow us to draw conclusions.  Inferential statistics is used to draw conclusions about characteristics of population based on data from a sample.  Hypothesis testing is one of the methods used in inferential statistics.  Hypothesis testing provides some criteria for reaching conclusions.

Hypothesis testing  Null hypothesis (H 0 )  hypothesis which the researcher tries to disprove, reject or nullify  “there is no difference (association) between groups (variables)”  H 0 : There is no difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is no association between daily calories intake and weight.  Alternative hypothesis (H 1 )  the hypothesis we accept if the null hypothesis is not true  “there is a difference (an association) between groups (variables)”  H 0 : There is a difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is an association between daily calories intake and weight.

When to reject/accept H 0 /H 1 ?

Significance levels and p-values  Used as criteria to accept or reject H 0  The p-value is probability of obtaining a difference as large (or larger) as that observed, if there is really no difference in the population from which the samples came, i.e. if the null hypothesis is true.  For small p-value (p<0.05) it is unlikely that the sample arose for a population where is true. Evidence for a real difference.  For large p-values (p>0.05) it is likely that the sample arose for a population where H 0 is true. There is no real difference.

Interpretation of-values Source; Essential medical statistics By Betty R. Kirkwood, Jonathan A. C. Sterne

Quiz A person was defined as hypertensive if their diastolic blood pressure was > 90 mmHg & their systolic was > 140 mmHg. The variable ‘hypertensive’ is: a)Paired continuous b)Nominal categorical c)Skewed d)Continuous

Quiz A person was defined as hypertensive if their diastolic blood pressure was > 90 mmHg & their systolic was > 140 mmHg. The variable ‘hypertensive’ is: a)Paired continuous b)Nominal categorical c)Skewed d)Continuous

What conclusion can be drawn from this figure? a)The mean is less than the standard deviation b)The mean is higher than the median c)There are fewer observations below the mean than above it d)The mean is approximately equal to the median

What conclusion can be drawn from this figure? a)The mean is less than the standard deviation b)The mean is higher than the median c)There are fewer observations below the mean than above it d)The mean is approximately equal to the median

Based on a sample of 153 newborns, the 95% CI for the population mean birth weight was between 3181 and 3319 grams: a)95% of the individual birth weights are between 3181 & 3319 grams b)The true mean for the 153 newborns is probably between 3181 & 3319 grams c)The mean of the population from which the 153 newborns came is between 3181 & 3319 grams d)There is a 95% chance that the true mean of the population from which the 153 newborns came is included in the range grams

Based on a sample of 153 newborns, the 95% CI for the population mean birth weight was between 3181 and 3319 grams: a)95% of the individual birth weights are between 3181 & 3319 grams b)The true mean for the 153 newborns is probably between 3181 & 3319 grams c)The mean of the population from which the 153 newborns came is between 3181 & 3319 grams d)There is a 95% chance that the true mean of the population from which the 153 newborns came is included in the range grams

Useful resource

Conclusion  Cover basic statistical concepts  Gain insight into what statistics mean  Gain confidence in understanding basic statistics Any questions?