Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South.

Day 2 Session 1 Basic statistics Gabriele Price (Gabriele.Price@sepho.nhs.uk)Gabriele.Price@sepho.nhs.uk Senior Public Health Intelligence Analyst South East Public Health Observatory

Overview  What is/are statistics?  Summarising data  Types of data  Normal distribution  Confidence intervals  Hypothesis testing  Significance and p-values

What is ‘statistics’? Statistics is the science of:  collecting  summarising  analysing  and interpreting sets of data. Statistics is a way to get information from data. That’s it!” Gerald Keller

What are statistics? Statistics are:  the numerical facts or data themselves  numbers derived from a sample of data that are describing some characteristics of a sample Some key concepts:  population  sample  unit  data  parameter vs. statistic.

Key statistical concepts  Population  is the group of all items of interest to a statistical practitioner  is frequently very large and sometimes infinite.

Key statistical concepts  Population  Sample  is a part or subset of the population used to gain information about the population  is still potentially large, but less than the population.

Key statistical concepts  Population  Sample  Unit  any individual member of the population/sample.

Key statistical concepts  Population  Sample  Unit  Data  is measurements that have been collected.

Key statistical concepts  Population  Sample  Unit  Data  Parameter or statistic  is a descriptive measure of a population or sample.  Inference is the process of making decisions about population based on information contained in a sample from the population  Sample should be selected though an appropriate process e.g. using random procedure.  Sample selection in health intelligence?  all hospital admissions in England  number of COPD sufferers on GP register in England

Types of data Numerical Categorical Quantitative: Qualitative: counted or measured characterises a quality Examples: Examples: Weight: 70kg Gender: male/female Height: 168 cm Smoking status: smoker 1 Age: 45 years ex-smoker 2 Blood pressure: 120 mmHg non-smoker 3 Numbers can be added, multiplied, averaged etc Any numbers used as labels cannot be added, multiplied, averaged etc

Numerical data Discrete Continuous Integers Any value on the scale (whole numbers) Examples: Examples: Number of people Height Number of admissions Weight Number of prescriptions Usually counts Usually measurements The distinction between discrete and continuous data is often not necessary for the purpose of analysis

Categorical data Nominal Ordinal No natural order Have a natural order Examples: Examples: Gender Social class Blood group Cancer staging Numerical data can be made Categorical: e.g. blood pressure > 90 mmHg  Hypertensive ≥ 90 mmHg  Normal Categorical data are often coded numerically for computer data entry e.g. male = 1 female = 2 This does not make it numerical data

Some notes to type of data…  Type of data applies to individual measurements, not to summary group statistics e.g. In a sample of patients, 30/100 (30%) are male.  In health intelligence, the summary statistics is often the individual measurement of interest.  The individual unit of analysis is more likely to be PCT, local authority, region, or SHA than an individual person. Examples:  Ethnic origin of an individual  % BME for a PCT  Individual death under 75 years from CHD  Under 75 rate for CHD for PCT

Which types of data are the following?  Birthweight  Marital status  Pain scale  Age at last birthday  Exact age  Number of visits to GP last year  Cancer staging  Cholesterol level  Number of colds last year  Length of hospital stay  Mortality

Which types of data are the following?  Birthweight: 3.050kg  numerical continuous  Marital status: married  categorical nominal  Pain scale (mild, moderate, sever): mild  categorical ordinal  Age at last birthday: 21 years  numerical discrete  Exact age: 21 years and 6 months (21.5)  numerical continuous  Number of visits to GP last year: 5  numerical discrete  Cancer staging: II  categorical ordinal  Cholesterol level: 4.6 mmol/l  numerical continuous  Number of colds last year: 2  numerical discrete  Mortality (dead/alive): alive  categorical nominal  Length of hospital stay: 7 days  numerical discrete

Summarising data  Often referred to as Descriptive Statistics or Summary Statistics  Graphical techniques i.e. frequency distribution.  Numerical techniques  measure(s) of central location  measure(s) of variation (dispersion).  … methods of organising, summarising and presenting data in a convenient and informative way.

Some data Diastolic pressure readings in 120 patients 90.970.882.687.485.878.679.081.482.484.3 100.271.198.086.769.990.383.676.791.685.3 69.0101.787.679.089.574.571.569.386.769.0 81.364.388.974.675.779.182.375.879.289.1 94.670.787.765.3101.185.986.873.595.667.6 100.271.076.683.985.974.382.880.266.285.5 76.080.491.272.298.073.584.586.264.389.1 71.778.188.283.991.898.092.085.872.890.4 89.185.884.780.254.185.979.877.686.474.5 80.190.281.856.565.490.580.777.369.693.7 82.384.378.896.7100.086.892.081.987.759.0 81.177.769.976.5101.191.886.493.066.496.0

Graphical techniques  Frequency distribution: lists data values by groups of intervals along with their corresponding frequencies (or counts). Histogram is a graphical display of tabulated frequencies. Diastolic pressureFrequency 55-593 60-642 65-6911 70-7414 75-7917 80-8422 85-8925 90-9414 95-996 100-1046

Graphical techniques  Frequency distribution  Relative frequency: determined in the same ways as frequency distribution except that it consists of the proportions of occurrences instead of the numbers of occurrences for each group. Diastolic pressure Relative frequency 55-59 2.5% 60-64 1.7% 65-69 9.2% 70-74 11.7% 75-79 14.2% 80-84 18.3% 85-89 20.8% 90-94 11.7% 95-99 5.0% 100-104 5.0%

Graphical techniques  Frequency distribution  Relative frequency  Cumulative frequency: is the running total of the frequencies. On a graph, it can be represented by a cumulative frequency curve. Diastolic pressure <= Cumulative Frequency 593 645 6916 7430 7947 8469 8994 108 99114 104120

Graphical techniques  Frequency distribution  Relative frequency  Cumulative frequency  Cumulative relative frequency: determined in the same ways as cumulative frequency distribution except that it consists of the proportions instead of the numbers. Diastolic pressure <= Cumulative Relative Frequency 592.5% 644.2% 6913.3% 7425.0% 7939.2% 8457.5% 8978.3% 9490.0% 9995.0% 104100.0%

Numerical techniques Measure(s) of central location  Mean: the measure of centre found by adding the values and dividing the total by the number of values. mean =

Numerical techniques Measure(s) of central location  Mean = 82  Median: the measure of centre that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If the number of values is even, the median is found by calculating the mean of the two middle values. median =

Numerical techniques Measure(s) of central location  Mean = 82  Median = 82.5  Mode: the value that occurs most frequently. mode = 89.1 and 98 and 85.8 and 85.9  multimodal  one mode Example: The number of visits made to the GP in 1 year by 21 patients 0,0,0,1,1,2,2,2,3,3,4,4,4,4,5,5,6,8,18,41,55  no mode Example: The age of the 21 patients who visited the GP in 1 year 2,5,6,7,12,15,21,30,55,59,60,61,69,70,71,73,80,85,87,89,90

Mean and median? Mean Median a) 3, 4, 5, 6, 7 b) 9, 10, 20, 21 c) 1, 2, 3, 4, 990

Mean and median? Mean Median a) 3, 4, 5, 6, 7 5 5 b) 9, 10, 20, 21 15 15 c) 1, 2, 3, 4, 990 200 3

What is the best measure of central tendency? No single best answer!

What is the best measure of central tendency? Measure of CentreHow Common?Existence Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages Mean Most familiar ‘ average ’ Always existsYes Works well with many statistical methods MedianCommonly usedAlways existsNo Often a good choice if there are some extreme values ModeSometimes used Might not exist; may be more than one mode No Rarely used in health related and medical statistics

Numerical techniques Measure(s) of variation  Range: difference between highest and lowest values.  poor measure of variance  sensitive to extreme values  however, often reported i.e. 82 (54.1, 101.7)

Numerical techniques Measure(s) of variation  Range  Inter-quartile range: difference between upper quartile and lower quartile  lower quartile has ¼ values smaller than it  upper quartile has ¼ values larger than it Box and whisker plots

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles: value below which a given proportion lies  divide the rank data into 100 groups  1 st percentile: 1% of data below, 99% above  5 th percentile: 5% of data below, 95% above  10 th percentile: 10% of data below, 90% above etc.  could describe the spread by the difference between 10 th and 90 th percentile  or the ratio of the 90 th percentile to the 10 th percentile

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles  Variance: a measure of the variation of a set of data points around their mean value.  step 1: calculate deviations (the difference between each observation and the mean of the data)  step 2: square these deviations  step 3: average the squared deviations (strictly divide by n-1, not n)  used in a variety of statistical tests, but on its own it is of limited practical use since it is squared value

Numerical techniques Measure(s) of variation  Range  Inter-quartile range  Percentiles  Variance  Standard deviation: a measure of the dispersion of a data set from its mean. Standard deviation is calculated as the square root of variance.  more useful measure of variation as returns the statistic to the same unit as the data

What is the best measure of variation? Measure of variation Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages RangeYes Poor measure of variance Inter-quartile rangeNo Often a good choice if there are some extreme values VarianceYes Reported as square value Standard deviationYes More useful than variance as in the same unit s as the data  Symmetric data: mean and standard deviation  Skewed data: median and inter-quartile range

Summarising Categorical data  Percentages and rates  Covered in Days 3 - Basic Analytical Techniques

Normal (N) distribution Mean = Median = Mode  Symmetrical  Bell shaped  Standard normal distribution mean=0 SD=1  Represents the distribution of values if whole population was studied

Normal distribution Changes in mean

Normal distribution Changes in standard deviation

Normal distribution  defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores

Normal distribution  Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores  Published data tables listing the area under the Standard Normal Curve

Normal distribution  Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]  Standard N scores – Z scores  Published data tables listing the area under the Standard Normal Curve  Used to calculate area between 2 points

Importance of N distribution  Many biological variables are N distributed or can be made N distributed by transformation  For some health related or medical data normal distributions are rare  Samples from a population that is normally distributed will not necessary look normal themselves, especially if sample is small  Normality can be assessed visually but better to use significance tests and normal plots

Skewed data

Transforming skewed data

Populations and samples  Samples used to provide estimates of population values  Will the sample give the right answer?  Bias and random error  Bias (systematic bias): the sample is selected in such a way that even a very large sample will not represent the true answer  select sample using appropriate process e.g. random sampling  measure the variable accurately  Random error: caused by any factors that randomly affect measurement of the variable across the sample.  different samples will give different answers  Good sample: large and randomly selected

How good is the sample?  Two measures of precision  Standard error: measures the amount of variability in the sample estimates. It indicates how closely the population mean or proportion is likely to be to the sample estimate. Mean, Proportion,

How good is the sample?  Two measures of precision  Standard error  Confidence intervals based on the Normal distribution, 95% sample estimates will be within 1.96 SEs from the true value provides a range of values within which the true (population) value is likely to lie for 95% of samples this interval will contain the true population value for any one sample there is a 95% chance that the interval contains the true value 5% risk (or 1 in 20 chance) that true value lies outside the 95% interval Narrow 95% CI  precise estimate Wide 95% CI  imprecise estimate

Some notes on confidence intervals  95% reference range  a measure of the spread of the data  contains 95% of the observations  95% confidence intervals  a measure of precision of a sample estimate  95% probability that the interval contains the true population value

Self-reported smoking status in women (%), by ethnic group with 95% confidence intervals (England, 2004)

Interpretation of confidence intervals  Non overlapping intervals indicative of real differences.  Overlapping intervals need to be considered with caution.  Need to be careful about using confidence intervals as a means of testing.  The smaller the sample size, the wider the confidence interval.

Interpretation of confidence intervals  What can we say about the true smoking prevalence for the general population?  For which ethnic groups is the prevalence of smoking significantly different from 25%?  Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations?  Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations?

Interpretation of confidence intervals  What can we say about the true smoking prevalence for the general population? 95% confident that the true smoking prevalence for the general population is between 22.5 and 24.5%  For which ethnic groups is the prevalence of smoking significantly different from 25%? For Black African, Indian, Pakistani, Bangladeshi and Chinese the prevalence of smoking is significantly different from 25%  Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations? The prevalence of smoking is significantly different between Black Caribbean and Black African groups  Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations? Cannot be sure that the prevalence of smoking is significantly different between the Pakistani and Bangladeshi populations

Hypothesis testing  Inference about population are often based upon a sample.  Descriptive statistics describe the data set, but doesn’t allow us to draw conclusions.  Inferential statistics is used to draw conclusions about characteristics of population based on data from a sample.  Hypothesis testing is one of the methods used in inferential statistics.  Hypothesis testing provides some criteria for reaching conclusions.

Hypothesis testing  Null hypothesis (H 0 )  hypothesis which the researcher tries to disprove, reject or nullify  “there is no difference (association) between groups (variables)”  H 0 : There is no difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is no association between daily calories intake and weight.  Alternative hypothesis (H 1 )  the hypothesis we accept if the null hypothesis is not true  “there is a difference (an association) between groups (variables)”  H 0 : There is a difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is an association between daily calories intake and weight.

When to reject/accept H 0 /H 1 ?

Significance levels and p-values  Used as criteria to accept or reject H 0  The p-value is probability of obtaining a difference as large (or larger) as that observed, if there is really no difference in the population from which the samples came, i.e. if the null hypothesis is true.  For small p-value (p<0.05) it is unlikely that the sample arose for a population where is true. Evidence for a real difference.  For large p-values (p>0.05) it is likely that the sample arose for a population where H 0 is true. There is no real difference.

Interpretation of-values Source; Essential medical statistics By Betty R. Kirkwood, Jonathan A. C. Sterne

Quiz A person was defined as hypertensive if their diastolic blood pressure was > 90 mmHg & their systolic was > 140 mmHg. The variable ‘hypertensive’ is: a)Paired continuous b)Nominal categorical c)Skewed d)Continuous

What conclusion can be drawn from this figure? a)The mean is less than the standard deviation b)The mean is higher than the median c)There are fewer observations below the mean than above it d)The mean is approximately equal to the median

Based on a sample of 153 newborns, the 95% CI for the population mean birth weight was between 3181 and 3319 grams: a)95% of the individual birth weights are between 3181 & 3319 grams b)The true mean for the 153 newborns is probably between 3181 & 3319 grams c)The mean of the population from which the 153 newborns came is between 3181 & 3319 grams d)There is a 95% chance that the true mean of the population from which the 153 newborns came is included in the range 3181 - 3319 grams

Useful resource http://www.apho.org.uk/apho/techbrief.htm

Conclusion  Cover basic statistical concepts  Gain insight into what statistics mean  Gain confidence in understanding basic statistics Any questions?

Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South.

Similar presentations

Presentation on theme: "Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South.

Similar presentations

Presentation on theme: "Day 2 Session 1 Basic statistics Gabriele Price Senior Public Health Intelligence Analyst South."— Presentation transcript:

Similar presentations

About project

Feedback