Download presentation
Presentation is loading. Please wait.
Published byDiane Laura Snow Modified over 9 years ago
1
DATA IDENTIFICATION AND ANALYSIS
2
Introduction During design phase of a study, the investigator must decide which type of data will be collected and analyzed. Large amounts of data will be collected, and not all of the data can or will be The investigator must decide how to summarize and present the large volume of data collected to give readers an accurate understanding of the study results.
3
Introduction The type of data collected will determine how the data are presented, described, and analyzed. there are various methods of describing and summarizing data. descriptive methods. simply describe or summarize data the data are analyzed statistically using inferential methods. inferential methods used to make inferences about populations.
4
Types of dataTypes of data Two types of data It is important to differentiate between the two because the type of data determines how the data are reported and analyzed.
5
Discrete dataDiscrete data Fit into a limited number of categories and are also referred to as count data. Each subject fits into only one discrete category.
6
Discrete dataDiscrete data 1. dichotomous variable. The simplest form i have only two categories per variable Gender Yes or NO questions
7
Discrete dataDiscrete data 2. multichotomous discrete data that have more than two possible values or categories Race is an example
8
Discrete dataDiscrete data 3. Ordinal variables. have a limited number of values or categories, but the categories are in a logical order or progression. degree of inflammation. typically measured in +1, +2, or +3 categories. pain. a scale of 1 to 10, with 1 being the least pain and 10 being the worst pain
9
Continuous dataContinuous data can take on any number of values along a specified continuum. data do not fit into a finite number of categories. often referred to as measurement data. An example is age can be measured in units of days, weeks, months, or years. Other examples are weight, height, and blood pressure. Any measurements made from the blood or urine (e.g., serum drug levels, serum creatinine, serum electrolytes, urine drug levels)
10
Continuous dataContinuous data many variables can be treated as either discrete or continuous. Age can be treated as a continuous variable and measured in units of days or years. can assume any value along a continuum that begins at day 1 and ends somewhere near 100 years.
11
Continuous dataContinuous data a data set of 10 subjects and the corresponding age of each subject. age is treated as a continuous variable.
12
Continuous dataContinuous data Age displayed as a discrete variable. a finite number of age categories is created such that each subject fits into only one category.
13
Types of DataTypes of Data Most studies involve multiple measurements on each subject at various points. blood pressure measurement made at baseline (before the intervention) and at various times after the treatment has begun (e.g., every 4 weeks for 6 months). The investigator must decide whether to analyze the results using paired or unpaired statistical methods.
14
Types of DataTypes of Data Paired methods analyze the difference between two measurements or observations. most commonly used to analyze before and after results..
15
Types of DataTypes of Data Unpaired analyses involve analysis of only one measurement or observation per subject, most commonly analyze the end result. Both continuous and discrete data can be considered to be either paired or unpaired
16
Distribution of dataDistribution of data Once data are collected, they are typically plotted on a distribution curve. The simplest distribution curve plots one variable along the x-axis and the number of subjects along the y-axis. Data can assume distribution curves that are either normally distributed or skewed.
17
Distribution of dataDistribution of data normal distribution all observations are clustered around the middle of the curve. The distribution is symmetrical, and the curve is bell- shaped. also referred to as a Gaussian distribution or a parametric distribution.
18
Distribution of dataDistribution of data a normal distribution of weight in a given population. Weight is reported on the x-axis, and the number of subjects is indicated on the y-axis.
19
Distribution of dataDistribution of data skewed distribution Not all data assume a normal distribution when plotted also referred to as non-Gaussian or nonparametric distributions. data are not clustered around the middle, and the distribution is not symmetrical. data will be clustered at one of the two tails of the distribution curve. can be positively or negatively skewed.
20
Distribution of dataDistribution of data Positively skewed data data in which the tail of the distribution is skewed to the right have the majority of observations occurring at the lower end of the x- axis of the distribution curve.
21
Distribution of dataDistribution of data Negatively skewed data data in which the tail of the distribution is skewed to the left, have the majority of observations occurring at the higher end of the x-axis.
22
Measures of central tendencyMeasures of central tendency Several measures of central tendency are used to describe data sets. often referred to as descriptive statistics. Descriptive statistics is an effective means of summarizing and reporting large amounts of data. It is important to consider both the type and the distribution of the data before applying descriptive statistics.
23
Measures of central tendencyMeasures of central tendency The Mean The most common descriptive measure used To calculate the mean, all of the values for all of the individuals in the data set are added and then the sum is divided by the total number of individuals in the sample.
24
Measures of central tendencyMeasures of central tendency The median the middle value when the data are arranged from the lowest value to the highest value. To determine the median, the data are arranged in ascending order. If the total sample size is an odd number, then the median is the number in the middle. If the total sample size is an even number, then the median is the average of the two middle numbers
25
Measures of central tendencyMeasures of central tendency The mode is the value that occurs most frequently in the data set. Data sets can be unimodal, with only one mode, or multimodal with more than one mode.
26
Measures of dispersionMeasures of dispersion used to describe the variability in a given sample. These measures give an overview of how data are spread, or dispersed, in a sample. 1-The extremes commonly used to describe the dispersion of a sample. the lowest and the highest values in the data set. 2-The range representative of the spread of the data set. calculated by subtracting the lowest value from the highest value.
27
Measures of dispersionMeasures of dispersion 3-Percentiles are also useful in describing data. indicate the percentage of individuals who have values equal to or below a given value. Examples of variables consistently reported in percentiles are pediatric height and weight and standardized test scores.
28
Measures of dispersionMeasures of dispersion 4-The variance provides information about how individuals differ within the sample. The larger the variance, the more variability in the sample. where S 2 = variance, = sum of, x i = individual value, = mean, n = sample size
29
Measures of dispersionMeasures of dispersion 5-The standard deviation (SD) Gives information about the variability of individuals around the mean. The larger the standard deviation, the more variability there is in the sample.
30
Measures of dispersionMeasures of dispersion In a normal distribution 68% of individuals will fall between –1 and +1 SD 95% of individuals will fall between –2 and +2 SD 99% of subjects will fall between –3 and +3 SD.
31
Measures of dispersionMeasures of dispersion The standard deviation is related to the variance. The standard deviation is the square root of the variance
32
Measures of dispersionMeasures of dispersion The normal distribution of large samples (>30 subjects) can be transformed into another distribution, called the z-distribution, similar to the standard normal distribution in that the area under both curves is equal to 1. the z-value on the x-axis represents how far an individual value is from the mean in units of standard deviation. a z-value of 1 is equivalent to 1 standard deviation above the mean, and a z-value of –1 is equivalent to 1 standard deviation below the mean.
33
Measures of dispersionMeasures of dispersion The z-distribution is a probability curve. Figure shows that the probability of a value falling in a z- range of 1 to 2 is 13% (area under curve = 0.1358).
34
Measures of dispersionMeasures of dispersion Small sample sizes (less than 30 subjects) are described using the t-distribution. the t-distribution looks like a bell-shaped curve Symmetrical contains an area under the curve that is equivalent to 1.
35
Measures of dispersionMeasures of dispersion The difference between the z- and t- distribution is the spread of the curve. The t-distribution is broader and flatter due to the smaller sample size.
36
Measures of dispersionMeasures of dispersion the smaller the sample size, the flatter and broader the t-distribution. As the sample size increases, the t- distribution more resembles the z- distribution. Effect of sample size on distribution
37
Measures of dispersionMeasures of dispersion 6-The standard error (SE) similar to the SD. provides information about the certainty of the mean itself. The distribution of the standard error is similar to the distribution of the standard deviation. The standard error is interpreted much like the SD in that, the larger the standard error, the more uncertain the sample mean.
38
Measures of dispersionMeasures of dispersion 7-Confidence intervals (CI) used to express the degree of confidence in an estimate. can be calculated for any given estimate obtained from a sample. denote the range of possible values the estimate may assume, with a certain degree of assurance.
39
Measures of dispersionMeasures of dispersion 7-Confidence intervals (CI) The 95% confidence intervals give information about the reliability of that estimate. The narrower the confidence intervals, the more reliable is the estimate. a mean age of 50 and a corresponding 95% confidence interval of (45, 55) is a more reliable estimate than a mean age of 50 in which the corresponding 95% confidence intervals is (30, 70).
40
Measures of dispersionMeasures of dispersion 7-Confidence intervals (CI) if the average LDL for a group of subjects is 180 mg/dL and the 95% confidence intervals are (160, 200), then the investigators are 95% confident that the true mean LDL for the sample falls between 160 mg/dL and 200 mg/dL. Or, if a study is conducted 100 times, then 95 times out of 100 the mean LDL will fall between 160 mg/dL and 200 mg/dL. Remember that the mean obtained in any sample or study is an estimate
41
Measures of dispersionMeasures of dispersion 7-Confidence intervals (CI) The 95% confidence intervals of a mean can be calculated as follows
42
Measures of dispersionMeasures of dispersion 7-Confidence intervals (CI) Confidence intervals also indicate the statistical significance of the results. The most commonly reported confidence intervals are 95%. It can be calculated at various degrees of assurance. For example, it is possible to calculate the 90% confidence intervals or the 99% confidence intervals. The 99% confidence intervals would be interpreted to mean that there is 99% assurance that the estimate lies within the bounds of the confidence intervals.
43
Application of descriptive measures The type, dispersion, and distribution of data all should be considered before applying descriptive statistics. when data assume a normal distribution the majority of values are clustered around the mean, or the center of the bell-shaped curve. the mean, median, and mode are equal. any measure of central tendency will adequately describe a normally distributed data set. The mean and standard deviation are most commonly used to describe data sets that are normally distributed.
44
Application of descriptive measures If the data assume a skewed distribution the mean and the median are not equal. the mean is not located in the center of the curve. the mean does not adequately describe the middle, or center, of a skewed distribution. the median indicates the center of the distribution. in a skewed distribution, the median and either the range or the extremes provide a much better description of the data set than the mean and the standard deviation. Percentiles can be used to describe both normally distributed and skewed data sets
45
Inferential methodsInferential methods Data are analyzed to determine any clinically relevant or statistically significant findings. Many statistical analyses can be used to analyze data. The most appropriate statistical test is determined by many factors. The data is statistically analyzed to test the null hypothesis. The null hypotheses are determined à priori. inferential methods are used to make inferences about populations.
46
Null HypothesisNull Hypothesis The null hypothesis (H 0 ) is a mathematical statement of equality stated before collecting or analyzing data. When comparing two groups, the null hypothesis is a mathematical statement that the groups are equal. the data are statistically analyzed and a p-value is calculated.
47
Null HypothesisNull Hypothesis Based on the calculated p-value, the null hypothesis will be either rejected or accepted. If there is no statistical difference between the two groups, then the null hypothesis will be accepted. if there is a statistical difference between the two groups, then the null hypothesis will be rejected.
48
The p-value. Statistical significance is determined by the p- value, which is the probability that the result observed in the sample was due to chance. In a two-sample test, the p-value is the probability that the difference observed between the two groups was due to chance. A p-value of 0.05 indicates a 5% probability that the difference observed between the two groups was due to chance.
49
The p-value. a p-value less than or equal to 0.05 is considered to be statistically significant and leads to rejection of the null hypothesis. a p-value greater than 0.05 leads to acceptance of the null hypothesis
50
limitations of the p-value.limitations of the p-value. p-values are sensitive to sample size. The larger the sample size, the smaller the p-value. It is easier to achieve a statistically significant result with large samples. p-values are sensitive to the magnitude of the difference observed between two groups. very small differences observed between two groups in large samples will be statistically significant, large differences observed between two groups in small samples will not be statistically significant.
51
limitations of the p-value.limitations of the p-value. CI may sometimes provide more information than p- values. CI provides information about both the statistical significance and the reliability or the stability of the results.
52
Choosing the appropriate statistical analysis Numerous factors must be considered before determining which statistical test is most appropriate to analyze a particular data set.
53
Choosing the appropriate statistical analysis Appendix I displays the process for determining the most appropriate statistical test to use for a particular data set.
55
Statistical Analysis of Continuous Data identify whether the distribution is normally distributed or skewed. Null hypotheses tested in skewed distributions are different from null hypotheses tested in normal distributions. parametric analyses test the null hypothesis that the means of two groups are equal
56
Unpaired, Normally Distributed, Continuous Data can be statistically analyzed using the t-test. When comparing two independent samples using the t-test, the null hypothesis states that the means of the two groups are equal.
57
Weight distributions for two groups in a study. These weights are normally distributed in both groups, but the mean weights for each group are slightly different. The null hypothesis is that the mean weights for groups 1 and 2 are equal. The statistical analysis will determine whether the mean weights for groups 1 and 2 are statistically different.
58
Paired, Normally Distributed, Continuous Data can be statistically analyzed using the paired t- test. When comparing two independent samples using the paired t-test, the null hypothesis states that the mean differences between the two groups is equal.
59
Paired, Normally Distributed, Continuous Data the t-test uses only one measurement per subject and thus compares the mean in each group the paired t-test uses two measurements per subject and thus compares the mean differences in each group.
61
Statistical Analysis of Skewed Continuous Data There are statistical tests designed specifically for continuous data that are skewed. These tests compare the medians of two groups. the mean does not adequately describe a skewed distribution Instead, nonparametric tests are used. T he Wilcoxon Rank Sum or the Mann Whitney Test can be used for unpaired skewed data. The Wilcoxon Signed Rank test can be used for paired skewed data. nonparametric analyses test the null hypothesis that the medians of two groups are equal.
62
Statistical Analysis of Skewed Continuous Data. (FEV) is measured at baseline and 4 months. The FEV distribution in both groups is skewed H 0 : median FEV inhaled steroid = median FEV oral steroid. A paired analysis would test the H 0 that the median differences in FEV (FEV at 4 months vs. FEV at baseline) are equal for both groups
63
Statistical Analysis of Discrete Data The Chi-Square Test and the Fisher's Exact Test are two statistical tests that are commonly used to analyze discrete data. The Chi-Square test is used to analyze large samples the Fisher's Exact Test is used to analyze small samples. The null hypotheses for both of these tests are the same. the null hypotheses are based on proportions. when comparing discrete data in two independent groups, the null hypothesis states that the proportions of a characteristic or event in the two groups are equal.
64
Interpretation of resultsInterpretation of results Both the sample size and the power of the study should be considered. the sample size, which is determined during the design phase of the study, influences the likelihood of achieving a statistically significant result. The preferred levels of both (type I error rate) and (type II error rate) are included in the sample size calculation. Investigators typically set the at 0.05 and the at 0.2.
65
Interpretation of resultsInterpretation of results The other determinant of sample size is the magnitude of the difference the investigator wishes to detect between the two groups. The magnitude of the difference is typically guided by prior study results on the same or similar drugs. The smaller the difference the investigator wishes to detect between two groups, the larger the sample size needed.
66
Interpretation of resultsInterpretation of results Power refers to the strength or ability of a study to detect a given difference between groups. the difference the investigator wishes to detect between two groups is decided in the design phase of the study and included in the sample size calculation. Sample size is related to power. As the sample size increases, so does the power of the study.
67
Interpretation of resultsInterpretation of results The power is also influenced by the magnitude of the difference observed between two groups. As the magnitude increases, the power increases. Power can be calculated by 1-beta. Because is usually set at 20%, investigators typically design studies with a goal of achieving at least 80% power
68
Interpretation of resultsInterpretation of results Regardless of the statistical results of the study, it is imperative to determine the clinical relevance of the actual study results. Many studies show statistically significant results, but the results are not clinically relevant. Clinical relevance the utility of the results in actual practice. A finding that is clinically relevant, for example, will result in one drug being used over another.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.