Introduction to Biostatistics

Introduction to Biostatistics
Georgi Iskrov, PhD Department of Social Medicine

Before we start

Outline Population vs sample Descriptive vs inferential statistics
Sampling methods Sample size calculation Level of measurement Graphical summaries

Outline Descriptive statistics Measures of central tendency
Measures of spread Normal distribution Central limit theorem Outliers Inferential statistic Confidence interval

Definition of biostatistics
The science of collecting, organizing, analyzing, interpreting and presenting data for the purpose of more effective decisions in clinical context.

Why do we need to use statistical methods?
To make strongest possible conclusion from limited amounts of data; To generalize from a particular set of data to a more general conclusion. What do we need to pay attention to? Bias Probability Statistics means never having to say you are certain!

Population vs Sample Population Parameters μ, σ, σ2
Sample / Statistics x, s, s2

Population vs Sample Population includes all objects of interest whereas sample is only a portion of the population. Parameters are associated with populations and statistics with samples Parameters are usually denoted using Greek letters (μ, σ) while statistics are usually denoted using Roman letters (X, s) There are several reasons why we do not work with populations. They are usually large, and it is often impossible to get data for every object we're studying Sampling does not usually occur without cost, and the more items surveyed, the larger the cost

Descriptive vs Inferential statistics
We compute statistics, and use them to estimate parameters. The computation is the first part of the statistical analysis (Descriptive Statistics) and the estimation is the second part (Inferential Statistics). Descriptive Statistics The procedure used to organize and summarize masses of data Inferential Statistics The methods used to find out something about a population, based on a sample

Descriptive vs Inferential statistics
Population Parameters Sampling From population to sample Sample Statistics From sample to population Inferential statistics

Sampling When a sample is drawn, there is no certainty that it will be representative for the population. Sample A Sample B

Error Random error can be conceptualized as sampling variability.
Bias (systematic error) is a difference between an observed value and the true value due to all causes other than sampling variability. Accuracy is a general term denoting the absence of error of all kinds.

Sampling Sampling A specific principle used to select members of population to be included in the study. Due to the large size of target population, researchers have no choice but to study the a number of cases of elements within the population to represent the population and to reach conclusions about the population. Biased sample Biased sample is one in which the method used to create the sample results in samples that are systematically different from the population. Random sample In random sampling, each item or element of the population has an equal chance of being chosen at each draw.

Sampling Sample B Sample A Population

Sampling Stages of sampling: Defining target population
Determining sampling size Selecting a sampling method Properties of a good sample: Random selection Representativeness by structure Representativeness by number of cases

Sampling Random sampling: Sample group members are selected in a random manner Highly effective if all subjects participate in data collection High level of sampling error when sample size is small Systematic: Including every Nth member of population in the study Time efficient Cost efficient High sampling bias if periodicity exists

Sampling Judgement: Sample group members are selected on the basis of judgement of researcher Time efficiency Samples are not highly representative Unscientific approach Personal bias Convenience: Obtaining participants conveniently with no requirements whatsoever High levels of simplicity and ease Usefulness in pilot studies Highest level of sampling error Selection bias

Sampling Snowball: Sample group members nominate additional members to participate in the study Possibility to recruit hidden population Over-representation of a particular network Reluctance of sample group members to nominate additional members

Sampling Stratified: Representation of specific subgroup or strata
Effective representation of all subgroups Precise estimates in cases of homogeneity or heterogeneity within strata Knowledge of strata membership is required Complex to apply in practical levels Cluster: Clusters of participants representing population are identified as sample group members Time and cost efficient Group-level information needs to be known Usually higher sampling errors compared to alternative sampling methods

Sample size calculation
Law of Large Numbers: As the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero. Application in biostatistics: Bigger sample size, smaller margin of error. A properly designed study will include a justification for the number of experimental units (people/animals) being examined. Sample size calculations are necessary to design experiments that are large enough to produce useful information and small enough to be practical.

Generally, the sample size for any study depends on: Acceptable level of confidence; Expected effect size and absolute error of precision; Underlying scatter in the population; Power of the study. High power Large sample size Large effect Little scatter Low power Small sample size Small effect Lots of scatter

For quantitative variables: Z – confidence level; SD – standard deviation; d – absolute error of precision.

For quantitative variables: A researcher is interested in knowing the average systolic blood pressure in pediatric age group at 95% level of confidence and precision of 5 mmHg. Standard deviation, based on previous studies, is 25 mmHg. => 97

For qualitative variables: Z – confidence level p – expected proportion in population d – absolute error of precision

For qualitative variables: A researcher is interested in knowing the proportion of diabetes patients having hypertension. According to a previous study, the actual number is no more than 15%. The researcher wants to calculate this size with a 5% absolute precision error and a 95% confidence level. => 196

When do you need biostatistics?
BEFORE you start your study! After that, it will be too late…

Planning Research programme: Aim Object Units of observation
Indices of observation Place Time Statistical analyses Methodology

Planning Aim The aim of the investigation is trying to summarize and formulate clearly the research hypothesis. Object Object of the investigation is the event, that is going to be studied. Units of observation Logical unit – each studied case Technical unit – the environment, where the logical units are situated Indices of observation – not too many, but important; measurable; additive and self controlling. Factorial Resultative

Planning Place Time Single – events are studied in a single moment of time, the so called “critical moment”. Continuous – used to characterize a long term tendency of the events Statistical analyses Methodology

Quantitative (metric) variables
Continuous Measured units Metric continuous variables can be properly measured and have units of measurement. Continuous values on proper numeric line or scale Data are real numbers (located on the number line). Discrete Integer values on proper numeric line or scale Metric discrete variables can be properly counted and have units of measurement – ‘numbers of things’. Counted units

Qualitative (categorical) variables
Nominal Values in arbitrary categories Ordering of the categories is completely arbitrary. In other words, categories cannot be ordered in any meaningful way. No units! Data do not have any units of measurement. Ordinal Values in ordered categories Ordering of the categories is not arbitrary. It is now possible to order the categories in a meaningful way.

Levels of measurement There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have. Nominal is the lowest level. Only names are meaningful here. Ordinal adds an order to the names. Interval adds meaningful differences. Ratio adds a zero so that ratios are meaningful.

Levels of measurement Nominal scale – eg., genotype
You can code it with numbers, but the order is arbitrary and any calculations would be meaningless. Ordinal scale – eg., pain score from 1 to 10 The order matters but not the difference between values. Interval scale – eg., temperature in C The difference between two values is meaningful. Ratio scale – eg., height It has a clear definition of 0. When the variable equals 0, there is none of that variable. When working with ratio variables, but not interval variables, you can look at the ratio of two measurements.

Descriptive statistics
Organising data Tables Frequency distributions Relative frequency distributions Graphs Bar chart Histogram Box plot Summarising data Central tendency (location) Variation (spread)

Graphical summaries Variable Graph Statistics One qualitative
Bar chart Pie chart Frequency table Relative frequency table Proportion Two qualitative Side-by-side bar chart Segmented bar chart Two-way table Difference in proportions One quantitative Dotplot Histogram Boxplot Measures of central tendency Measures of spread Other: five number summary, percentiles, distribution shape One quantitative by one qualitative Side-by-side boxplots Stacked dotplots Statistics broken down by group Difference in means Two quantitative Scatterplot Correlation

Frequency distribution
Frequency distribution of survival for both groups Survival Frequency 14 2 17 1 21 1 22 2 23 1 24 2 25 1 27 1 28 1 29 1 31 1 33 1 34 2 35 1 39 1 41 1 Total 20 Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Classes of values Control group (10 patients) Individual survival in months: 24 31 39 35 34 24 14 21 25 22

Relative frequency distribution
Relative frequency distribution of survival for both groups Survival Frequency Percent Cumulative percent % 10% % 15% % 20% % 30% % 35% % 45% % 50% % 55% % 60% % 65% % 70% % 75% % 85% % 90% % 95% % 100% Total % 38

Grouped relative frequency distribution
Relative frequency distribution of survival for both groups Survival Frequency Percent Cumulative Percent 10 – % 10% 15 – % 15% 20 – % 45% 25 – % 65% 30 – % 85% 35 – % 95% 40 – % 100% Total % Classes of intervals What rules to follow when groupping data?

Descriptive statistics
Summarising data: Central tendency (or sample’s middle value) Mean Median Mode Spread (or summary of differences within groups) Range Interquartile range Variance Standard deviation

Mean Most commonly called average. Experimental group (10 patients)
Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Control group (10 patients) Individual survival in months: 14 21 25 22

Median The middle value when a variable’s values are ranked in order.
The point that divides a distribution into two equal halves. When data are listed in order, the median is the point at which 50% of the cases are above and 50% below it. The 50th percentile.

Median Control group (10 patients) Individual survival in months: 14
21 22 24 25 31 34 35 39 Median = 24.5 (five cases above, five below)

Median If the recorded values for a variable form a symmetric distribution, the median and mean are identical. In skewed data, the mean lies further toward the skew than the median. Symmetric Skewed Mean Mean Median Median

Mode The most common data point is called the mode.
Individual survival data for the control group are: 14, 21, 22, 24, 24, 25, 31, 34, 35, 39 It is possible to have more than one mode. If all values are unique, there is no mode. Mode may mot be at the center of a distribution.

Mode It may give you the most likely experience rather than the typical or central experience. In symmetric distributions, the mean, median and mode are the same. In skewed data, the mean and median lie further toward the skew than the mode. Skewed Symmetric Mean Median Mode Mode Median Mean

Spread Variation of the recorded values on a variable.
The larger the spread, the further the individual cases are from the mean. The smaller the spread, the closer the individual scores are to the mean. Mean Mean

Range The spread, or the distance, between the lowest and highest values of a variable. To get the range for a variable, you subtract its lowest value from its highest value. Experimental group (10 patients) Individual survival in months: 23 27 17 34 41 28 22 33 29 14 Range = 41 – 14 = 27 Control group (10 patients) Individual survival in months: 24 31 39 35 34 24 14 21 25 22 Range = 39 – 14 = 25

Standard deviation Standard deviation takes into account all individual deviations. A deviation is the distance away from the mean of a case’s score. Experimental group’s SD = 8.13 months Control group’s SD = 7.64 months

Standard deviation The larger standard deviation, the greater amounts of variation around the mean. Standard deviation is equal to 0, only when all values are the same. Like the mean, the standard deviation will be inflated by an outlier case value.

Interquartile range The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively. IQR is equal to Q3 minus Q1.

Central tendency and spread
Central tendency: Mean, mode and median Spread: Range, interquartile range, standard deviation Mistakes: Focusing on only the mean and ignoring the variability Standard deviation and standard error of the mean Variation and variance What is best to use in different scenarios? Symmetrical data: mean and standard deviation Skewed data: median and interquartile range

Boxplot

Important rules When a constant is added to every observation, the new sample mean is equal to original mean plus the constant. When a constant is added to every observation, the standard deviation is unaffected. When every observation is multiplied by the same constant, the new sample mean is equal to original mean multiplied by the constant. When every observation is multiplied by the same constant, the new sample standard deviation is equal to original standard deviation multiplied by the magnitude of the constant.

Normal (Gaussian) distribution
Mean and standard deviations are a particularly appropriate summary for data whose histogram approximates a normal distribution (the bell-shaped curve). If you say that a set of data has a mean survival of 29 months, the typical listener will picture a bell-shaped curve centered with its peak at 29 months.

Rule of 3-sigma When data are approximately normally distributed:
approximately 68% of the data lie within one SD of the mean; approximately 95% of the data lie within two SDs of the mean; approximately 99% of the data lie within three SDs of the mean.

Normal (Gaussian) distribution
Central limit theorem: Create a population with a known distribution that is not normal; Randomly select many samples of equal size from that population; Tabulate the means of these samples and graph the frequency distribution. Central limit theorem states that if your samples are large enough, the distribution of the means will approximate a normal distribution even if the population is not Gaussian. Mistakes: Normal vs common (or disease free); Few biological distributions are exactly normal.

Outliers Values that lie very far away from the other values in the data set.

Outliers Outliers can occur for several reasons: Mistakes:
Invalid data entry Biological diversity Random chance Experimental error Skewed distribution Mistakes: Not realizing that outliers are common in data sampled from skewed distribution Eliminating outliers only when you do not get the results you want Truly removing outliers from your records

Outliers Outlier test:
If values are sampled from a normal distribution, what is the chance one value will be as far from the others as the extreme value observed? Examples: Chauvenet criterion, Grubbs test, Peirce criterion Nevertheless, deletion of outlier data is generally a controversial practice!

Inferential statistics
Population Parameters Sampling From population to sample Sample Statistics From sample to population Inferential statistics

Confidence interval for the population mean
Population mean: point estimate vs interval estimate Standard error of the mean – how close the sample mean is likely to be to the population mean. Assumptions: a random representative sample, independent observations, the population is normally distributed (at least approximately). Confidence interval depends on: sample mean, standard deviation, sample size, degree of confidence. Mistakes: 95% of the values lie within the 95% CI; A 95% CI covers the mean ± 2 SD.

Standard error of mean The sample mean estimates individual values. The uncertainty with which this mean estimates individual values is given by the standard deviation. The sample mean estimates the population mean. The uncertainty with which this mean estimates the population mean is given by the standard error of the mean.

The confidence interval for the mean gives us a range of values around the mean where we expect the “true” population mean is located. 95% confidence interval for the population mean is:

The duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods (in years) of a random sample of 30 HIV infected individuals are: 12.0, , 9.5, 6.3, 13.5, 12.5, 7.2, 12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, , 10.7, 7.2, 14.9, 6.5, 8.1, 7.9, 12.0, 6.3, 7.8, 6.3, 12.5, 5.2, 13.1, , 7.2. Calculate the 95% CI for the population mean incubation period in HIV. X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years)

X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years) 99% level of confidence => Z = 2.58 µ = 9.5 ± (2.58 x 0.5) = 9.5 ± 1.3 years 99% CI for µ is (8.2; 10.8 years)

Describing qualitative data
Improvement No improvement Total Gluten-free diet 54 46 100 No gluten-free diet 47 53 101 99 200

Describing qualitative data
Standard error of proportion: The 95% confidence interval for a population proportion is:

Introduction to Biostatistics

Similar presentations

Presentation on theme: "Introduction to Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Biostatistics

Similar presentations

Presentation on theme: "Introduction to Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback