Simple statistics for clinicians on respiratory research By Giovanni Sotgiu Hygiene and Preventive Medicine Institute University of Sassari Medical School Italy
What are your expectations?
Too difficult to explain medical statistics in 30 min…..
What is medical statistics?
“..Discipline concerned with the treatment of numerical data derived from groups of individuals..” P Armitage “..Art of dealing with variation in data through collection, classification and analysis in such a way as to obtain reliable results..” JM Last What is medical statistics?
Collection of statistical procedures well-suited to the analysis of healthcare-related data
Why we need to study statistics in the field of medicine……..
1)Basic requirement of medical research 2)Update your medical knowledge 3)Data management and treatment Why we need to study statistics…
1) Basic concepts 2) Sample and population 3)Probability 4) Data description 5) Measures of disease Road map
Basic concepts
All individuals have similar values or belong to the same category Ex.: all individuals are Chinese, ….women, ….middle age (30~40 years old), ….work in the same factory homogeneity in nationality, gender, age and occupation 1. Homogeneity
Basic concepts Differences in height, weight, treatment… 1. Variation
Toss a coin The mark face may be up or down Treat the patients suffering from TB with the same antibiotics: a part of them recovered and others didn’t 1. Variation
no variation, no statistics 1. Variation
What is the target of our studies?
Population
the whole collection of individuals that one intends to study 2. Population
economic issues short time 2. Population
2. Population and sample
a representative part of the population 2. Sample
Sampling By chance!
Random Random event the event may occur or may not occur in one experiment before one experiment, nobody is sure whether the event occurs or not
Random Please, give some examples of random event…
The mathematical procedures whereby we convert information about the sample into intelligent guesses about the population fall under the section of inferential Statistics (generalization)
Probability
3. Probability Measure the possibility of occurrence of a random event P(A) = The Number Of Ways Event A Can Occur The total number Of Possible Outcomes
Number of observations: n (large enough) Number of occurrences of random event A: m P(A) m/n relative frequency theory Estimation of Probability Frequency
3. Probability A random event P(A) Probability of the random event A P(A) 1, if an event always occurs P(A) 0, if an event never occurs
Please, give some examples for probability of a random event and frequency of that random event
Parameters and statistics
4. Parameter A measurement describing some characteristic of a population or A measurement of the distribution of a characteristic of a population Greek letter (μ,π, etc.) Usually unknown
to know the parameter of a population we need a sample
A measurement describing some characteristic of a sample or A measurement of the distribution of a characteristic of a sample Latin letter (s, p, etc.) 4. Statistic
Please give an example for parameter and statistics Does a parameter vary? Does a statistic vary? 4. Statistic
Sampling Error
5. Sampling Error Difference between observed value and true value
5. Sampling Error 1) Systematic error (fixed) 2) Measurement error (random) 3) Sampling error (random)
Sampling error The statistics different from the parameter! The statistics of different samples from same population different each other!
Sampling error The sampling error exists in any sampling research It can not be avoided but may be estimated
Nature of data
Variables and data Variables are labels whose value can literally vary Data is the value you get from observing measuring, counting, assessing etc.
Data Categorical Data Metric Data Nominal Data Ordinal Data Discrete Data Continuous Data
Nominal or categorical data It can be allocated into one of a number of categories Blood type, sex, Linezolid treatment (y/n) Data cannot be arranged in an ordering scheme
Ordinal categorical data It can be allocated to one of a number of categories but it has to be put in meaningful order Differences cannot be determined or are meaningless Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied (new treatment)
Discrete metric data Countable variables number of possible values is a finite number Numbers of days of hospitalization Numbers of men treated with isoniazid
Continuous metric data Measurable variables Infinitely many possible values continuous scale covering a range of values without gaps Kg, m, mmHg, years
Describing data….. with tables
Describing data with tables 1) actual frequency 2) relative and cumulative frequency 3) grouped frequency 4) open- ended groups 5) cross-tabulation
1) Frequency table Frequency distribution TB mortality (%)TallyNo. of wards , 1, 1, 1, 1, 1, 1, 1, , 1, 1, 1, 1, 1, 1, , 1, 1, 1, , 1, ,1 variables frequency
2) Relative frequency, cumulative frequency Relative frequency proportion of the total No. of resistancesNo. of patients Relative frequency (%) Cumulative frequency (%)
3) Grouped frequency Grouped frequency works for continuous metric data Birth weightNo. of infants born from mothers with TB A group width of 300g The class lower limit The class upper limit
General rules Frequency table nominal, ordinal and discrete metric data Grouped frequency table continuous metric data
4) Open-ended group One or more values which are called outliers, long away from the general mass of the data Use ≤ or ≥
5) Cross-tabulation Two variables within a single group of individuals Pulmonary mass TB/HIV+ Totals YesNo Benign Malignant448 Totals
Describing data….. with charts
3. Describing data with charts 1)Charting nominal data a)pie chart b)simple bar chart c)cluster bar chart d) stacked bar chart 2) Charting ordinal data a)pie chart b)bar chart c)dotplot 3) Charting discrete metric data 4) Charting continuous metric data histogram 5) Charting cumulative ordinal or discrete metric data step chart 6) Charting cumulative metric continuous data cumulative frequency or ogive 7) Charting time based time –series chart
1-a) Pie chart 4-5 categories One variable Start at 0° in the same order as the table Adverse events of ethionamide
1-b) Simple bar chart Same widths, equal spaces b/w bars n
1-c) Clustered bar chart
1-d) Stacked bar chart
2-3) Dot-plot Useful with ordinal variables if the number of categories is too large for a bar chart
4) Histogram Percentage of age distribution of pregnant TB women < >35 TB cases %
6) Cumulative frequency curve
Describing data from its distributional shape
Symmetric mound-shaped distributions > > 35 Percentage of age distribution of pregnant women with TB
Skewed distributions > 85 Age distribution for migrants who develop TB
Bimodal distributions A bimodal distribution is one with two distinct humps
Normal-ness Symmetric Same mean, median, mode
Describing data with numeric summary value
1. numbers, proportions (percentages) 2. summary measures of location 3. summary measures of spread
Numbers and proportions Numbers actual frequencies Percentage is a proportion multiplied by 100 1)Prevalence 2) Incidence
Prevalence -nature relative frequency number of existing cases in some population at a given time t0t0 disease health
Prevalence No. of existing cases of a disease at t 0 = 0…..1 total population A (N=6)B (N=4) f a =1 No comparison f r =0.17f r =0.25 Comparison Disease Health
Prevalence P == 0 P == 0.25 P == 1 Disease Health
Prevalence Prevalence data: - Highlight the time of the evaluation Example: P (2010)= 0.17 P (2010)= 17 per 100 individuals
Incidence estimates the risk of developing disease t0t0 t1t1 People at risk (healthy) Disease Health
No. of new cases during given t 0 - t 1 total population at risk Incidence - Measures the probability or risk of developing disease during given time period - Absolute risk probabilityof developing an adverse event
Incidence -Assess the health status at baseline esclude prevalent cases at t 0 -Define a follow-up for the cohort Healthy people followed-up for a given time period
Cohort Closed Population adds no new members over time, and loses members only to disease/death Open Population may gain members over time, through immigration or birth, or lose members through emigration
Cumulative incidence - Closed population - Individual time period at risk same period for all the members A > B > C > D > E > t0t0 t1t1 time PeoplePeople 03
No. of new cases during given t 0 - t 1 total population at risk Cumulative incidence
Example: t 0 = 24; new cases= 3; follow-up = 3 years CI in 3 years = new cases per 1 individual at risk enrolled at t new cases in 100 individuals at risk enrolled at t 0 t0t0 t1t1 time PeoplePeople 03 Cumulative incidence
- Closed popularion rare - Short follow-up and enrollment of a few individuals - Open population Cumulative incidence…critical features
Open population -Non cases (drop-out) and cases during the follow-up - Enrollment of new individuals during the follow-up - Length of follow-up not uniform
A > B > D > F > H > t0t0 t1t1 time PeoplePeople G > I > Drop-out Case C > E > Open population
Coorte dinamica Individual time period at risk not uniform Estimate the population at risk: - Total person-time - Estimate of the total person-time
Coorte dinamica Total person-time individual time period at risk Person-time: days-, months-, years
Density of incidence No. of new cases during given t 0 - t 1 total person-time
1 (A)51 person x 5 years5 person-years 3 (B, C, D)23 person x 2 years6 person-years 2 (E, F)2.52 person x 2.5 years5 person-years 2 (G, H)1.52 person x 1.5 years3 person-years 1 (I)31 person x 3 years3 person-years N Individual time period at risk Person-years Total person-time22 person-years Person-years Density of incidence
1 new case 22 person-years 0,045 new cases = 1 person-years = 0,045 45 per 1000 person-years Density of incidence
Open population Estimate of the total person-time Individual time period at risk not known for all -Migration Movement of the cohort in the middle of the follow-up
Estimate of the total person-time (P 0 + P t )/2 x follow-up
At t 0 : 100 people Follow-up: 3 years New cases: 3 Drop-out: 17 Enrollment during the follow-up: 16 >>>P 0 = 100; P t = ( ) = 96 (P 0 + P t )/2 x follow-up ( )/2 x 3 = 294 person-years Estimate of the total person-time
Test the estimate: 80 people x 3 years = 240 person-years Movement of the cohort (17 x 1.5) + (3 x 1.5) + (16 x 1.5) = 54 person-years = 294 person-years At t 0 : 100 people Follow-up: 3 years New cases: 3 Drop-out: 17 Enrollment during the follow-up: 16 Estimate of the total person-time
Incidence rate 3 new cases/ 294 person-years x 1000 = 10.2 No. of new cases during given t 0 - t 1 estimate of total person-time
Summary measures of location 1)mode: category or value occurs the most often, typical- ness. Categorical, metric discrete 2) median: middle value in ascending order, central-ness. ordinal and metric data 3) mean (average): divide the sum of the values by the number of values 4) percentile: divide the total number of the values into 100 equal-sized groups.
Choosing the most appropriate measure ModeMedianMean Nominalyesno Ordinalyes no Metric discrete yesYes, when markedly skewed yes Metric continuous yesYes, when markedly skewed yes
Summary measure of spread Range distance from the smallest value to the largest IQR (interquartile range) spread of the middle half of the values Boxplot graphical summary of the three quartile values, the minimum and maximum values, and outliers.
Standard deviation Average distance of all the data values from the mean value The smaller the average distance is, the narrower the spread, and vice versa Used metric data only
1.Subtract the mean from each of the n value in the sample, to give the different values 2. Square each of these differences 3. Add these squared values together (sum of squares) 4. Divide the sum of squares by 1 less than the sample size. (n-1) 5. Take the square-root
Standard deviation and the normal distribution
The Basic Steps of Statistical Work 1. Design of study Professional design: Research aim Subjects, Measures, etc.
Statistical design: Sampling or allocation method, Sample size, Randomization, Data processing, etc.
2. Collection of data Source of data Government report system Registration system Routine records Ad hoc survey
Data collection accuracy, complete, in time Protocol: Place, subjects, timing; training; pilot; questionnaire; instruments; sampling method and sample size; budget Procedure: observation, interview filling form, letter telephone, web
3. Data Sorting Checking Hand, computer software Amend Missing data? Grouping According to categorical variables (sex, occupation, disease…) According to numerical variables (age, income, blood pressure …)
4. Data Analysis Descriptive statistics (show the sample) mean, incidence rate … -- Table and plot Inferential statistics (towards the population) -- Estimation Hypothesis test (comparison)
Definition of Selection Bias Selection bias: Selection biases are distortions that result from procedures used to select subjects and from factors that influence study participation. The common element of such biases is that the association between exposure and disease is different for those who participate and those who should be theoretically eligible for study, including those who do not participate.
Definition of Selection Bias It is sometimes (but not always) possible to disentangle the effects of participation from those of disease determinants using standard methods for the control of confounding. One example is the bias introduced by matching in case-control studies.
Definition of Confounding Confounding: bias in estimating an epidemiologic measure of effect resulting from an imbalance of other causes of disease in the compared groups. (mixing of effects)
Characteristics of a Confounder associated with disease (in non-exposed) associated with exposure (in source population) not an intermediate cause