Download presentation
Presentation is loading. Please wait.
Published byMalcolm Atkins Modified over 8 years ago
1
Day 2 Session 1 Basic statistics Gabriele Price (Gabriele.Price@sepho.nhs.uk)Gabriele.Price@sepho.nhs.uk Senior Public Health Intelligence Analyst South East Public Health Observatory
2
Overview What is/are statistics? Summarising data Types of data Normal distribution Confidence intervals Hypothesis testing Significance and p-values
3
What is ‘statistics’? Statistics is the science of: collecting summarising analysing and interpreting sets of data. Statistics is a way to get information from data. That’s it!” Gerald Keller
4
What are statistics? Statistics are: the numerical facts or data themselves numbers derived from a sample of data that are describing some characteristics of a sample Some key concepts: population sample unit data parameter vs. statistic.
5
Key statistical concepts Population is the group of all items of interest to a statistical practitioner is frequently very large and sometimes infinite.
6
Key statistical concepts Population Sample is a part or subset of the population used to gain information about the population is still potentially large, but less than the population.
7
Key statistical concepts Population Sample Unit any individual member of the population/sample.
8
Key statistical concepts Population Sample Unit Data is measurements that have been collected.
9
Key statistical concepts Population Sample Unit Data Parameter or statistic is a descriptive measure of a population or sample. Inference is the process of making decisions about population based on information contained in a sample from the population Sample should be selected though an appropriate process e.g. using random procedure. Sample selection in health intelligence? all hospital admissions in England number of COPD sufferers on GP register in England
10
Types of data Numerical Categorical Quantitative: Qualitative: counted or measured characterises a quality Examples: Examples: Weight: 70kg Gender: male/female Height: 168 cm Smoking status: smoker 1 Age: 45 years ex-smoker 2 Blood pressure: 120 mmHg non-smoker 3 Numbers can be added, multiplied, averaged etc Any numbers used as labels cannot be added, multiplied, averaged etc
11
Numerical data Discrete Continuous Integers Any value on the scale (whole numbers) Examples: Examples: Number of people Height Number of admissions Weight Number of prescriptions Usually counts Usually measurements The distinction between discrete and continuous data is often not necessary for the purpose of analysis
12
Categorical data Nominal Ordinal No natural order Have a natural order Examples: Examples: Gender Social class Blood group Cancer staging Numerical data can be made Categorical: e.g. blood pressure > 90 mmHg Hypertensive ≥ 90 mmHg Normal Categorical data are often coded numerically for computer data entry e.g. male = 1 female = 2 This does not make it numerical data
13
Some notes to type of data… Type of data applies to individual measurements, not to summary group statistics e.g. In a sample of patients, 30/100 (30%) are male. In health intelligence, the summary statistics is often the individual measurement of interest. The individual unit of analysis is more likely to be PCT, local authority, region, or SHA than an individual person. Examples: Ethnic origin of an individual % BME for a PCT Individual death under 75 years from CHD Under 75 rate for CHD for PCT
14
Which types of data are the following? Birthweight Marital status Pain scale Age at last birthday Exact age Number of visits to GP last year Cancer staging Cholesterol level Number of colds last year Length of hospital stay Mortality
15
Which types of data are the following? Birthweight: 3.050kg numerical continuous Marital status: married categorical nominal Pain scale (mild, moderate, sever): mild categorical ordinal Age at last birthday: 21 years numerical discrete Exact age: 21 years and 6 months (21.5) numerical continuous Number of visits to GP last year: 5 numerical discrete Cancer staging: II categorical ordinal Cholesterol level: 4.6 mmol/l numerical continuous Number of colds last year: 2 numerical discrete Mortality (dead/alive): alive categorical nominal Length of hospital stay: 7 days numerical discrete
16
Summarising data Often referred to as Descriptive Statistics or Summary Statistics Graphical techniques i.e. frequency distribution. Numerical techniques measure(s) of central location measure(s) of variation (dispersion). … methods of organising, summarising and presenting data in a convenient and informative way.
17
Some data Diastolic pressure readings in 120 patients 90.970.882.687.485.878.679.081.482.484.3 100.271.198.086.769.990.383.676.791.685.3 69.0101.787.679.089.574.571.569.386.769.0 81.364.388.974.675.779.182.375.879.289.1 94.670.787.765.3101.185.986.873.595.667.6 100.271.076.683.985.974.382.880.266.285.5 76.080.491.272.298.073.584.586.264.389.1 71.778.188.283.991.898.092.085.872.890.4 89.185.884.780.254.185.979.877.686.474.5 80.190.281.856.565.490.580.777.369.693.7 82.384.378.896.7100.086.892.081.987.759.0 81.177.769.976.5101.191.886.493.066.496.0
18
Graphical techniques Frequency distribution: lists data values by groups of intervals along with their corresponding frequencies (or counts). Histogram is a graphical display of tabulated frequencies. Diastolic pressureFrequency 55-593 60-642 65-6911 70-7414 75-7917 80-8422 85-8925 90-9414 95-996 100-1046
19
Graphical techniques Frequency distribution Relative frequency: determined in the same ways as frequency distribution except that it consists of the proportions of occurrences instead of the numbers of occurrences for each group. Diastolic pressure Relative frequency 55-59 2.5% 60-64 1.7% 65-69 9.2% 70-74 11.7% 75-79 14.2% 80-84 18.3% 85-89 20.8% 90-94 11.7% 95-99 5.0% 100-104 5.0%
20
Graphical techniques Frequency distribution Relative frequency Cumulative frequency: is the running total of the frequencies. On a graph, it can be represented by a cumulative frequency curve. Diastolic pressure <= Cumulative Frequency 593 645 6916 7430 7947 8469 8994 108 99114 104120
21
Graphical techniques Frequency distribution Relative frequency Cumulative frequency Cumulative relative frequency: determined in the same ways as cumulative frequency distribution except that it consists of the proportions instead of the numbers. Diastolic pressure <= Cumulative Relative Frequency 592.5% 644.2% 6913.3% 7425.0% 7939.2% 8457.5% 8978.3% 9490.0% 9995.0% 104100.0%
22
Numerical techniques Measure(s) of central location Mean: the measure of centre found by adding the values and dividing the total by the number of values. mean =
23
Numerical techniques Measure(s) of central location Mean = 82 Median: the measure of centre that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If the number of values is even, the median is found by calculating the mean of the two middle values. median =
24
Numerical techniques Measure(s) of central location Mean = 82 Median = 82.5 Mode: the value that occurs most frequently. mode = 89.1 and 98 and 85.8 and 85.9 multimodal one mode Example: The number of visits made to the GP in 1 year by 21 patients 0,0,0,1,1,2,2,2,3,3,4,4,4,4,5,5,6,8,18,41,55 no mode Example: The age of the 21 patients who visited the GP in 1 year 2,5,6,7,12,15,21,30,55,59,60,61,69,70,71,73,80,85,87,89,90
25
Mean and median? Mean Median a) 3, 4, 5, 6, 7 b) 9, 10, 20, 21 c) 1, 2, 3, 4, 990
26
Mean and median? Mean Median a) 3, 4, 5, 6, 7 5 5 b) 9, 10, 20, 21 15 15 c) 1, 2, 3, 4, 990 200 3
27
What is the best measure of central tendency? No single best answer!
28
What is the best measure of central tendency? Measure of CentreHow Common?Existence Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages Mean Most familiar ‘ average ’ Always existsYes Works well with many statistical methods MedianCommonly usedAlways existsNo Often a good choice if there are some extreme values ModeSometimes used Might not exist; may be more than one mode No Rarely used in health related and medical statistics
29
Numerical techniques Measure(s) of variation Range: difference between highest and lowest values. poor measure of variance sensitive to extreme values however, often reported i.e. 82 (54.1, 101.7)
30
Numerical techniques Measure(s) of variation Range Inter-quartile range: difference between upper quartile and lower quartile lower quartile has ¼ values smaller than it upper quartile has ¼ values larger than it Box and whisker plots
31
Numerical techniques Measure(s) of variation Range Inter-quartile range Percentiles: value below which a given proportion lies divide the rank data into 100 groups 1 st percentile: 1% of data below, 99% above 5 th percentile: 5% of data below, 95% above 10 th percentile: 10% of data below, 90% above etc. could describe the spread by the difference between 10 th and 90 th percentile or the ratio of the 90 th percentile to the 10 th percentile
32
Numerical techniques Measure(s) of variation Range Inter-quartile range Percentiles Variance: a measure of the variation of a set of data points around their mean value. step 1: calculate deviations (the difference between each observation and the mean of the data) step 2: square these deviations step 3: average the squared deviations (strictly divide by n-1, not n) used in a variety of statistical tests, but on its own it is of limited practical use since it is squared value
33
Numerical techniques Measure(s) of variation Range Inter-quartile range Percentiles Variance Standard deviation: a measure of the dispersion of a data set from its mean. Standard deviation is calculated as the square root of variance. more useful measure of variation as returns the statistic to the same unit as the data
34
What is the best measure of variation? Measure of variation Takes every value into Account? Affected by Extreme Values Advantages and Disadvantages RangeYes Poor measure of variance Inter-quartile rangeNo Often a good choice if there are some extreme values VarianceYes Reported as square value Standard deviationYes More useful than variance as in the same unit s as the data Symmetric data: mean and standard deviation Skewed data: median and inter-quartile range
35
Summarising Categorical data Percentages and rates Covered in Days 3 - Basic Analytical Techniques
36
Normal (N) distribution Mean = Median = Mode Symmetrical Bell shaped Standard normal distribution mean=0 SD=1 Represents the distribution of values if whole population was studied
37
Normal distribution Changes in mean
38
Normal distribution Changes in standard deviation
39
Normal distribution defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2] Standard N scores – Z scores
40
Normal distribution Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2] Standard N scores – Z scores Published data tables listing the area under the Standard Normal Curve
41
Normal distribution Defined by complex formula f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2] Standard N scores – Z scores Published data tables listing the area under the Standard Normal Curve Used to calculate area between 2 points
42
Importance of N distribution Many biological variables are N distributed or can be made N distributed by transformation For some health related or medical data normal distributions are rare Samples from a population that is normally distributed will not necessary look normal themselves, especially if sample is small Normality can be assessed visually but better to use significance tests and normal plots
43
Skewed data
44
Transforming skewed data
45
Populations and samples Samples used to provide estimates of population values Will the sample give the right answer? Bias and random error Bias (systematic bias): the sample is selected in such a way that even a very large sample will not represent the true answer select sample using appropriate process e.g. random sampling measure the variable accurately Random error: caused by any factors that randomly affect measurement of the variable across the sample. different samples will give different answers Good sample: large and randomly selected
46
How good is the sample? Two measures of precision Standard error: measures the amount of variability in the sample estimates. It indicates how closely the population mean or proportion is likely to be to the sample estimate. Mean, Proportion,
47
How good is the sample? Two measures of precision Standard error Confidence intervals based on the Normal distribution, 95% sample estimates will be within 1.96 SEs from the true value provides a range of values within which the true (population) value is likely to lie for 95% of samples this interval will contain the true population value for any one sample there is a 95% chance that the interval contains the true value 5% risk (or 1 in 20 chance) that true value lies outside the 95% interval Narrow 95% CI precise estimate Wide 95% CI imprecise estimate
48
Some notes on confidence intervals 95% reference range a measure of the spread of the data contains 95% of the observations 95% confidence intervals a measure of precision of a sample estimate 95% probability that the interval contains the true population value
49
Self-reported smoking status in women (%), by ethnic group with 95% confidence intervals (England, 2004)
50
Interpretation of confidence intervals Non overlapping intervals indicative of real differences. Overlapping intervals need to be considered with caution. Need to be careful about using confidence intervals as a means of testing. The smaller the sample size, the wider the confidence interval.
51
Interpretation of confidence intervals What can we say about the true smoking prevalence for the general population? For which ethnic groups is the prevalence of smoking significantly different from 25%? Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations? Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations?
52
Interpretation of confidence intervals What can we say about the true smoking prevalence for the general population? 95% confident that the true smoking prevalence for the general population is between 22.5 and 24.5% For which ethnic groups is the prevalence of smoking significantly different from 25%? For Black African, Indian, Pakistani, Bangladeshi and Chinese the prevalence of smoking is significantly different from 25% Is the prevalence of smoking significantly different between the Black Caribbean and Black African populations? The prevalence of smoking is significantly different between Black Caribbean and Black African groups Is the prevalence of smoking significantly different between the Pakistani and Bangladeshi populations? Cannot be sure that the prevalence of smoking is significantly different between the Pakistani and Bangladeshi populations
53
Hypothesis testing Inference about population are often based upon a sample. Descriptive statistics describe the data set, but doesn’t allow us to draw conclusions. Inferential statistics is used to draw conclusions about characteristics of population based on data from a sample. Hypothesis testing is one of the methods used in inferential statistics. Hypothesis testing provides some criteria for reaching conclusions.
54
Hypothesis testing Null hypothesis (H 0 ) hypothesis which the researcher tries to disprove, reject or nullify “there is no difference (association) between groups (variables)” H 0 : There is no difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is no association between daily calories intake and weight. Alternative hypothesis (H 1 ) the hypothesis we accept if the null hypothesis is not true “there is a difference (an association) between groups (variables)” H 0 : There is a difference in cholesterol level between patients taking statins and patients not taking statins. H 0 : There is an association between daily calories intake and weight.
55
When to reject/accept H 0 /H 1 ?
56
Significance levels and p-values Used as criteria to accept or reject H 0 The p-value is probability of obtaining a difference as large (or larger) as that observed, if there is really no difference in the population from which the samples came, i.e. if the null hypothesis is true. For small p-value (p<0.05) it is unlikely that the sample arose for a population where is true. Evidence for a real difference. For large p-values (p>0.05) it is likely that the sample arose for a population where H 0 is true. There is no real difference.
57
Interpretation of-values Source; Essential medical statistics By Betty R. Kirkwood, Jonathan A. C. Sterne
58
Quiz A person was defined as hypertensive if their diastolic blood pressure was > 90 mmHg & their systolic was > 140 mmHg. The variable ‘hypertensive’ is: a)Paired continuous b)Nominal categorical c)Skewed d)Continuous
59
Quiz A person was defined as hypertensive if their diastolic blood pressure was > 90 mmHg & their systolic was > 140 mmHg. The variable ‘hypertensive’ is: a)Paired continuous b)Nominal categorical c)Skewed d)Continuous
60
What conclusion can be drawn from this figure? a)The mean is less than the standard deviation b)The mean is higher than the median c)There are fewer observations below the mean than above it d)The mean is approximately equal to the median
61
What conclusion can be drawn from this figure? a)The mean is less than the standard deviation b)The mean is higher than the median c)There are fewer observations below the mean than above it d)The mean is approximately equal to the median
62
Based on a sample of 153 newborns, the 95% CI for the population mean birth weight was between 3181 and 3319 grams: a)95% of the individual birth weights are between 3181 & 3319 grams b)The true mean for the 153 newborns is probably between 3181 & 3319 grams c)The mean of the population from which the 153 newborns came is between 3181 & 3319 grams d)There is a 95% chance that the true mean of the population from which the 153 newborns came is included in the range 3181 - 3319 grams
63
Based on a sample of 153 newborns, the 95% CI for the population mean birth weight was between 3181 and 3319 grams: a)95% of the individual birth weights are between 3181 & 3319 grams b)The true mean for the 153 newborns is probably between 3181 & 3319 grams c)The mean of the population from which the 153 newborns came is between 3181 & 3319 grams d)There is a 95% chance that the true mean of the population from which the 153 newborns came is included in the range 3181 - 3319 grams
64
Useful resource http://www.apho.org.uk/apho/techbrief.htm
65
Conclusion Cover basic statistical concepts Gain insight into what statistics mean Gain confidence in understanding basic statistics Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.