Univariate analysis Önder Ergönül, MD, MPH 17-28 June 2019.

Slides:



Advertisements
Similar presentations
CHAPTER TWELVE ANALYSING DATA I: QUANTITATIVE DATA ANALYSIS.
Advertisements

Statistical Tests Karen H. Hagglund, M.S.
Final Review Session.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Chapter 14 Inferential Data Analysis
Statistics Idiots Guide! Dr. Hamda Qotba, B.Med.Sc, M.D, ABCM.
Inferential Statistics
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Fall 2013 Lecture 5: Chapter 5 Statistical Analysis of Data …yes the “S” word.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Final review - statistics Spring 03 Also, see final review - research design.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Chapter 13 Understanding research results: statistical inference.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
Dr Hidayathulla Shaikh. Objectives At the end of the lecture student should be able to – Discuss normal curve Classify parametric and non parametric tests.
Statistics & Evidence-Based Practice
Inferential Statistics
Nonparametric Statistics
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Logic of Hypothesis Testing
Data measurement, probability and Spearman’s Rho
Causality, Null Hypothesis Testing, and Bivariate Analysis
Comparing Two Means Prof. Andy Field.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Introduction to Statistics: Probability and Types of Analysis
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Non-Parametric Tests 12/1.
HYPOTHESIS TESTS.
Non-Parametric Tests 12/1.
PCB 3043L - General Ecology Data Analysis.
Non-Parametric Tests 12/6.
Statistics.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
CHOOSING A STATISTICAL TEST
Basic Statistics Overview
Parametric vs Non-Parametric
Non-Parametric Tests.
Y - Tests Type Based on Response and Measure Variable Data
Analysis of Data Graphics Quantitative data
Social Research Methods
Inferential statistics,
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
SDPBRN Postgraduate Training Day Dundee Dental Education Centre
Nonparametric Statistical Methods: Overview and Examples
Nonparametric Statistics
Introduction to Statistics
Comparing Groups.
Nonparametric Statistical Methods: Overview and Examples
NURS 790: Methods for Research and Evidence Based Practice
Nonparametric Statistical Methods: Overview and Examples
Nonparametric Statistical Methods: Overview and Examples
Non-parametric tests, part A:
CS 594: Empirical Methods in HCC Experimental Research in HCI (Part 2)
Non – Parametric Test Dr. Anshul Singh Thapa.
Unit XI: Data Analysis in nursing research
Statistics II: An Overview of Statistics
15.1 The Role of Statistics in the Research Process
Parametric versus Nonparametric (Chi-square)
Understanding Statistical Inferences
InferentIal StatIstIcs
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Examine Relationships
Presentation transcript:

Univariate analysis Önder Ergönül, MD, MPH 17-28 June 2019

Statistics is used to: Describe data – descriptive statistics Compare study groups Search for any correlation between variables Search for a regression between variables in order to extrapolate the dependent variable if independent variable(s) is(are) known Analyze health outcomes

Overview of Presentation Which test to use? Parametric assumptions Non-parametric tests ANOVA (Analysis of variance) Correlation

Dependent vs independent variables The dependent variable represents the output or the effect, or is tested to see if it is the effect.

Dependent vs independent variables The independent variable is the variable that you have control over, what you can choose and manipulate. is usually what you think will affect the dependent variable. In some cases, you may not be able to manipulate the independent variable. It may be something that is already there and is fixed, like color, kind, time; something you would like to evaluate with respect to how it affects the dependent variable.

Which statistical test(s)? How many study groups are there? One Two More than two

Which statistical test(s)? 2. Type of data

Which statistical test(s)? Are the consecutive measurements / assessments of the dependent variable? If the dependent variable is metric, is it normally distributed? What are you looking for? Difference between groups Relationship between variables

Univariate analysis: Comparison of 2 groups Variable Metric Categorical Is distribution Normal? Non-parametric tests The readers should have an idea about the statistical tests, although they do not need to be able to build a car in order to drive one. For comparison of two groups, the first step is to see whether the group of data is categoric or continous, and then to understand whether the data distributed normally or not.

Univariate analysis: Comparison of 2 groups Metric variable Symmetric (Normal) distribution Asymmetric Student t test Mann-Whitney-U test Willcoxon test For the continuous variable, if the data is distributed normally, then the student t testi is the choice, otherwise Mann-Whitney U for unpaired samples, and Willcoxon test for paired samples are suggested.

Unpaired (parallel, independent) groups Dependent variable Independent variable Test (P) Test (NP) categorical Chi square* metric (2 groups) Student t test Mann Whitney U (>2 groups) One way ANOVA Kruskal Wallis Pearson correlation Spearman correlation *Chi square tests are neither P nor NP.

Paired (dependent) groups Dependent variable Independent variable Test categorical Mc Nemar metric categorical (2 groups) paired t test or Wilcoxon categorical (>2 groups) Friedman Spearman correlation

Chi Squared Test A chi-squared test, also referred to as chi-square test or test, is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as desired by making the sample size large enough.

Assumptions Normality Homogeneity of variance

Centre of a frequency distribution    “Central Tendency”  Mean  Median  Mode

Median vs Mean Mean = 95 Calculate the mean without the highest score (234) = 81.1 Median = 95.5

Assumptions Normality Homogeneity of variance

Assumptions Homogeneity of variance Variance: Average error between the mean and the observations made A measure in units square Standard Deviation (SD): Square root of the variance The measure of average error is in the same units as the original measure Sum of squared error

Assumptions Homogeneity of variance

Why does the sample size matter? “Central limit theorem”

Parametric - Nonparametric Tests A parametric test is a test based on some parametric assumptions, most often an assumption that the distribution of the variable in the population is normal. A nonparametric test does not rely on parametric assumptions, like having a normal distribution and equal variances.

Parametric - Nonparametric Tests Groups compared 2 independent groups 2 dependent groups More than 2 independent groups More than 2 dependent groups Parametric Unpaired t-test Paired t-test ANOVA Repeated measures ANOVA Nonparametric Mann-Whitney U test Wilcoxon signed rank test Kruskal – Wallis test Friedman’s ANOVA

Nonparametric Tests Do not rely on population parameters such as mean or SD No assumptions about the distribution of the population Based on the ranks of the values rather than the values themselves Ranks the values, get the sum of the ranks and compare this sum to the expected frequency where all the ranks are equally likely Makes inferences on median not mean (the median is a much better measure)

Three-Step Process Begin with the Null Hypothesis Test the Hypothesis Look and make sense of the data Then get the P value Conclude Reject Null Hypothesis if P < 0.05 (or 5%) or Accept Null Hypothesis

Compare two independent groups for continuous variable Sample size? Aim: To examine the relationship between total cholesterol levels and heart attack. A total of 28 male heart attack patients had their cholesterol levels measured at 2 days after heart attack. Cholesterol levels were recorded for a control group of 30 male patients of similar age and weight who had not had a heart attack. Compare two independent groups for continuous variable

Non-normal distribution or very unequal dispersions Mann-Whitney U Test

Three step process What is the p value? Null Hypothesis: “The median total serum cholesterol level of cases at two days post attack is the same as the median total serum cholesterol level of controls”. Test the hypothesis: What is the p value? Mann-Whitney p value = 0.0001 Conclude: Since the p value is <0.05, we reject H0, and conclude that our result is statistically significant.

Compare two dependent groups for continuous variable Sample size? Aim: To examine the relationship between total cholesterol levels and heart attack. A total of 28 male heart attack patients had their cholesterol levels measured at 2 days and 4 days post attack. Compare two dependent groups for continuous variable

Wilcoxon signed-ranks test Non-normal distribution or very unequal dispersions Cases 2 days 4 days N 28 Mean 253.93 235.32 Median 268 239 SD 47.71 60.3 Wilcoxon signed-ranks test

Three step process What is the p value? Null Hypothesis: “The median difference in total serum cholesterol level of cases at 2 days post attack and at 4 days post attack is zero.” Test the hypothesis: What is the p value? Wilcoxon signed ranks p value = 0.02 Conclude: Since the p value is <0.05, we reject H0, and conclude that our result is statistically significant.

Compare independent three groups for continuous variable Sample size? Aim: To examine the relationship between total cholesterol levels and household income. A total of 150 participants had their cholesterol levels measured. Income was categorized as high, middle and low income. Compare independent three groups for continuous variable

ANOVA (Analysis of Variance) Normal distribution and similar dispersions High Income Middle Income Low Income N 35 55 60 Mean 144.8 148.3 157.5 Median 150 147 160 SD 28.3 25.8 27.9 ANOVA (Analysis of Variance)

ANOVA (Analysis of Variance) ANOVA is based on comparing the variance (or variation) between the data samples to variation within each particular sample. If the between variation is much larger than the within variation, the means of different samples will not be equal. If the between and within variations are approximately the same size, then there will be no significant difference between sample means.

ANOVA (Analysis of Variance) All populations involved follow a normal distribution. All populations have the same variance (or standard deviation). The samples are randomly selected and independent of one another. The samples are independent of one another.

Three step process What is the p value? Null Hypothesis: “The mean total serum cholesterol levels of three different groups of participants are similar” Test the hypothesis: What is the p value? F test p value = 0.01 Conclude: Since the p value is <0.05, we reject H0, and conclude that our result is statistically significant.

Which means are statistically significantly different from each other? There are many methods for comparing means after rejecting the null hypothesis. Planned comparisons vs post hoc tests. Pairwise comparison: Bonferroni correction (p value/number of pairwise comparisons) (Kruskal – Wallis Test) Tukey post hoc test p - value High - Middle 0.053 High- Low 0.01 Middle - Low 0.02

Parametric - Nonparametric Tests Groups compared 2 independent groups 2 dependent groups More than 2 independent groups More than 2 dependent groups Parametric Unpaired t-test Paired t-test ANOVA Repeated measures ANOVA Nonparametric Mann-Whitney U test Wilcoxon signed rank test Kruskal – Wallis test Friedman’s ANOVA

Summary Type of the study and the characteristics of the data are important when choosing the accurate test Perform an analysis strategy before collecting the data Do not forget to check the assumptions before analysing the data When normality and equality of variances assumptions are violated, use nonparametric tests, otherwise use parametric tests which are more powerful

Comparison of study groups at baseline in RCT According to the CONSORT statement, significance testing of baseline differences in randomized controlled trials should not be performed.

Comparison of study groups at baseline in RCT

Baseline imbalance in RCTs Any baseline difference between the groups under study are by definition due to chance (as long as the randomization was performed correctly).

Baseline imbalance in RCTs Whether baseline differences are significant does not have any implications for the validity of the results of the study. Even a covariate* that is balanced between treatment groups (according to a p-value) can affect the association between treatment and outcome. *A covariate is a variable that is possibly predictive of the outcome under study.

Baseline imbalance in RCTs Choice of baseline characteristics by which an analysis is adjusted should be determined by prior knowledge of an influence on outcome rather than evidence of imbalance between treatment groups in the trial. Such information should ideally be included in trial protocols and reported with details of the analysis.

What should we do? At the planning stage of a study, baseline variables of prognostic value should be identified on the basis of available evidence. These should be fitted in an analysis of covariance or equivalent technique for other data types. Other variables should not be added to the analysis unless information from other sources during the course of the trial suggests their inclusion.

Nazi Germany invasion of Norway and the cardiovascular disease correlation https://www.youtube.com/watch?v=HZpYkD_plPw

Correlation analysis The logic of the correlation is straightforward. Where there is a linear relationship between two variables there is said to be a correlation between them. The strength of that relationship is given by the “correlation coefficient”, and indicated by the letter “r”.

Correlation analysis Body weight Systolic blood pressure r = 0.70 A positive correlation coefficient means that as one variable is increasing the value for the other variable is also increasing.

Coronary artery diameter Correlation analysis Total cholesterol (mg/dL) Coronary artery diameter (mm) r = -0.85 A negative correlation coefficient means that as the value of one variable goes up the value for the other variable goes down.

Correlation analysis If there is a perfect relationship between two variables, then r=1, or r=-1(negative correlation).

Correlation analysis r=0 r=0 r=0

Correlation analysis Correlation coefficient is between -1 and +1 Exactly –1. A perfect downhill (negative) linear relationship –0.70. A strong downhill (negative) linear relationship –0.50. A moderate downhill (negative) relationship –0.30. A weak downhill (negative) linear relationship 0. No linear relationship +0.30. A weak uphill (positive) linear relationship +0.50. A moderate uphill (positive) relationship +0.70. A strong uphill (positive) linear relationship Exactly +1. A perfect uphill (positive) linear relationship

Correlation analysis Spearman (nonparametric) vs Pearson (parametric) correlation P value – statistical significance of the correlation coefficient (r) Outliers

How to present correlation?

Summary Your data determines which statistical test you need. Think about specific hypothesis at start of study. Clearly define your hypotheses. Determine how to collect data / which data to collect. Parametric tests typically more powerful than non-parametric tests. This lecture, I will focus on Case-Control studies