Data Analysis Dr. Andrea Benedetti.  Plan General thoughts on data analysis Data analysis  for RCTs  for Case Control studies  for Cohort studies.

Slides:



Advertisements
Similar presentations
KRUSKAL-WALIS ANOVA BY RANK (Nonparametric test)
Advertisements

Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Departments of Medicine and Biostatistics
HSRP 734: Advanced Statistical Methods July 24, 2008.
Statistical Tests Karen H. Hagglund, M.S.
EPI 809 / Spring 2008 Final Review EPI 809 / Spring 2008 Ch11 Regression and correlation  Linear regression Model, interpretation. Model, interpretation.
Final Review Session.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Lecture 9: One Way ANOVA Between Subjects
Data Analysis Statistics. Inferential statistics.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 13 Using Inferential Statistics.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Today Concepts underlying inferential statistics
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
ANOVA and Regression Brian Healy, PhD.
Statistics Idiots Guide! Dr. Hamda Qotba, B.Med.Sc, M.D, ABCM.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
Week 9: QUANTITATIVE RESEARCH (3)
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
ANALYSIS OF VARIANCE. Analysis of variance ◦ A One-way Analysis Of Variance Is A Way To Test The Equality Of Three Or More Means At One Time By Using.
Hypothesis Testing:.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Analysis of Categorical Data
Simple Linear Regression
Statistics for clinical research An introductory course.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
X Treatment population Control population 0 Examples: Drug vs. Placebo, Drugs vs. Surgery, New Tx vs. Standard Tx  Let X =  cholesterol level (mg/dL);
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Linear correlation and linear regression + summary of tests
Introduction to Survival Analysis Utah State University January 28, 2008 Bill Welbourn.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Master’s Essay in Epidemiology I P9419 Methods Luisa N. Borrell, DDS, PhD October 25, 2004.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
More Contingency Tables & Paired Categorical Data Lecture 8.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Tutorial I: Missing Value Analysis
Chapter 13 Understanding research results: statistical inference.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Nonparametric Statistics
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Approaches to quantitative data analysis Lara Traeger, PhD Methods in Supportive Oncology Research.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Bivariate analysis. * Bivariate analysis studies the relation between 2 variables while assuming that other factors (other associated variables) would.
Measures of disease frequency Simon Thornley. Measures of Effect and Disease Frequency Aims – To define and describe the uses of common epidemiological.
Nonparametric Statistics
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
More than two groups: ANOVA and Chi-square
Nonparametric Statistics
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Association, correlation and regression in biomedical research
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Data Analysis Dr. Andrea Benedetti

 Plan General thoughts on data analysis Data analysis  for RCTs  for Case Control studies  for Cohort studies

Basic steps – Descriptive Stats  Start with univariable descriptive statistics for each variable Understand the distribution of your variables  Histograms Missing values? Data quality checks (e.g. reasonable values for height, weight, etc.)  Mins, Maxes  Logical consistency What units?  Bivariable descriptive statistics cross tabulations scatterplots  Descriptives should tell most of the story most of the time

Basic steps -- Modelling  Modelling (regression) should be a small part of total time spent  Descriptives should tell you everything you need to know for models  Minimize total number of models run Decreases false positives Increases reproducibility

RCTS

RCTs  Randomization (usually) takes care of confounding  Two types of analyses: Pre-specified in the protocol  Findings form the basis for guidelines, etc.  Generally: intention to treat, unadjusted, in the whole population Secondary analyses  may or may not be prespecified  more exploratory in nature

RCTs – Which participants should be analyzed?  Intention to treat ITT: Analyze everyone as they were randomized, even if...  they did not comply with treatment  they were ineligible  they were lost to follow up Maintains the benefits of randomization  most valid May underestimate the treatment effect  Most conservative

Per protocol analysis  definitions vary– make sure you specify!  subjects have now self selected into treatment groups especially ‘as treated’ lose the benefits of randomization  must adjust for confounders!!

RCTs – Adjust for confounders?  Main analysis usually does not consider adjusting BUT… should consider adjusting  when there is imbalance on important confounders  as a sensitivity analysis  when using a per protocol analysis  if there is variable follow up time When adjustment will be used, and for which variables should be pre-specified

RCTs -- Subgroups  Subgroup analyses Define which subgroups a priori Use strict criterion Interpret sceptically!! Especially for subgroup analyses not pre-specified

RCT…Primary analysis What kind of outcome?  Binary outcome Chi square or Fisher exact test  Continuous outcome T-test  Survival Log rank test  Counts Test for counts

The Logic is Always the Same: 1. Assume nothing is going on (H0) 2. Calculate a test statistic (Chi-square, t) 3. How often would you get a value this large for the test statistic when H0 is true? (pvalue) 4. If p <.05, reject H0 and conclude that something is going on (HA) 5. If p >.05, do not conclude anything.

Example  Randomized 120 subjects with MDR-TB Standard care + homeopathy Standard care + placebo  Outcome: sputum conversion  ITT: SC+H: 29/60 SC+P: 23/60  Chisquare: 1.22  P=0.27

RCT… In secondary analysis  consider adjusting for confounders, especially if ‘Table 1’ shows imbalance use linear, logistic, Poisson or Cox regression depending on outcome type we will see as we move along  subgroups  per protocol analyses

Example  After adjustment for baseline WHO status of HIV infection (stage 4 vs. stage 3), CD4+ cell count, age, sex, history of TB, extrapulmonary TB, and baseline HIV RNA level, the hazard ratio was 0.43 (95% CI, 0.25 to 0.77; P = 0.004).

OBSERVATIONAL STUDIES

Observational Studies  Confounding!!  Use a regression approach which allows adjustment for confounders investigation of effect modifiers

Linear regression: quick review Y i =    X  i  i  attempt to fit a linear equation to observed data One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable Can include more than one variable Estimated via maximum likelihood

Assumptions  Y|X ~Normal  X-Y association is linear  homoscedasticity  independence

Interpreting parameters from simple linear regression Y i = 0 + 1 X 1i + i  Continuous X1  1 is the expected change in Y for a 1 unit change in X1  0 is the average Y when X1=0  may or may not be logical!  Binary X1 (e.g. gender)  1 is the average difference in Y for subjects with X1=1 vs. X1=0  0 is the average Y when X1=0

Interpreting parameters from multiple linear regression Y i = 0 + 1 X 1 + 2 X 2i + i   1 is the expected change in Y for a 1 unit change in X1 for subjects with the same value for X2 we might say  1 is the expected change in Y adjusting for X2   0 is the average Y when X1=0 and X2=0  may or may not be logical!

Adding an interaction term  Suppose we are interested in seeing whether there is effect modification by gender Y i = 0 + 1 Exposure i + 2 Male i + 3 Male i *Exposure i + i  1 is the expected change in Y for a 1 unit change in Exposure among Females   is the average difference in Y between Males and Females when X1=0   +  3 is the expected change in Y for a 1 unit change in Exposure among Males

Example  69 TB patients Blood serum measured 2h after drug ingestion For every unit increase in drug dose, average blood serum increased by 0.33 units, after adjusting for acetyl INH/INH

ANALYZING DATA FROM CASE CONTROL STUDIES

Case Control Studies  Sampling is based on disease status Then exposure is ascertained  The usual parameter of interest is the odds ratio OR= (odds of E+ in the D+)/(odds of E+ in the D-) = (odds of D+ in the E+)/(odds of D= in the E-) = (A/B)/(C/D) =AD/BC Disease +- Exposure + AB - CD

Case control studies  Typically use logistic regression Outcome variable is binary Y~Binomial(p) Adjust for important confounders Consider pre-specified interactions

Case control studies  Logistic regression estimates the probability of an event occurring  logit(p)=log e (p/(1-p))  logit(p)=    X 1 +  X 2 …

Interpreting parameters from a logistic regression  Consider: logit(Y exp )=    X 1  Among the exposed logit(Y exp )=    (1)  Among the unexposed logit(Y unexp )=  (2)  Subtract (2) from (1): logit(Y exp )- logit(Y unexp )=  

So…

Analyzing matched case-control studies  Advantage of matching: helps to control confounding at the design stage  Must account for matching in the analysis Matching induces a bias in the effect estimates Otherwise effect estimates will be biased towards null

Conditional logistic regression  Used when we have cases matched to controls  We are interested in the comparison WITHIN strata  Cannot estimate the effect of the matching factor but can estimate the interaction between that factor and another variable

Example  All cases with hepatotoxicity  3 age and sex matched controls

General approach for case control studies 1. Compare cases and controls on important covariates 2. Estimate the exposure effect adjusted for confounders via logistic regression/conditional logistic regression 3. Investigate effect modification 4. Check assumptions

ANALYZING DATA FROM COHORT STUDIES

Analyzing data from cohort studies  Subjects are recruited and followed over time to see if the outcome develops  Several different approaches are possible depending on the type of outcome

b+d Frame work of Cohort studies cc+d aa+b TotalYes Disease Status Yes No Exposure Status b d a+c N No Study cohort Comparison cohort

Overview of statistical approaches for cohort studies

Poisson Regression  Appropriate when subjects are followed for varying lengths of time  Outcome is a count  Similar to logistic regression, however now Y~Poisson  The link function is log log(Y)= 0 + 1 X1+ 2 X exp( 1 )=RR

Cox Proportional Hazards  The outcome of interest is time to event  No need to choose a particular probability model to represent survival times  Semi-parametric (Kaplan-Meier is non-parametric)  Easy to incorporate time-dependent covariates—covariates that may change in value over the course of the observation period

Assumptions of Cox Regression  Proportional hazards assumption: the hazard for any individual is a fixed proportion of the hazard for any other individual  Multiplicative risk

41 Recall: The Hazard function In words: the probability that if you survive to t, you will succumb to the event in the next instant.

42 The model Components: A baseline hazard function that is left unspecified but must be positive (=the hazard when all covariates are 0) A linear function of a set of k fixed covariates that is exponentiated. (=the hazard ratio) Can take on any form!

43 The model Proportional hazards: Hazard functions should be strictly parallel! Produces covariate-adjusted hazard ratios! Hazard for person j (eg a non-smoker) Hazard for person i (eg a smoker) Hazard ratio

 Covariate values for an individual may change over time E.g. If you are evaluating the effect of weight on diabetes risk over a long study period, subjects may gain and lose large amounts of weight, making their baseline weight a less than ideal predictor.  Cox regression can handle these time- dependent covariates! Time-dependent covariates

45  Ways to look at drug use:  Not time-dependent Ever/never during the study Yes/no use at baseline Total months use during the study  Time-dependent Using drug use at event time t (yes/no) Months of drug use up to time t Time-dependent covariates

General approach for cohort studies 1. Compare exposed and unexposed subjects on important covariates 2. Estimate the exposure effect adjusted for confounders via logistic regression, Poisson regression or Cox PH model 3. Investigate effect modification 4. Check assumptions

NO MATTER WHAT METHOD YOU USE…

Your statistical methods should....  match your objectives  be appropriate for the type of outcome you have  be appropriate for the study design you have chosen

Functional form  Think about what form the association between a continuous variable and the outcome has Linear? Nonlinear?  Categorize the variable Equal sized categories? Categories based on clinically relevant cutoffs? NOT data driven cut offs Less powerful?  Use a flexible modelling approach Splines? Fractional polynomial?

Which variables are important?  Subject matter drives this to a large extent Known confounders Eliminate intermediates  DAGs?  Statistical significance?  Data driven selection e.g. stepwise selection

Missing data  Think carefully about how to address it Drop individuals? Drop variables? Multiple imputation?  Usually the best option.

Outliers  May have a lot of influence on your statistical tests  May be mistakes, or could be true data Should be examined to ensure they are true data If erroneous, could be removed or corrected

Other complications  Clustered data longitudinal data families, households, etc  Observations are not independent  Must account for the correlation between observations on the same person/in the same family/etc.  Mixed models, or marginal models estimated via GEE

Checking assumptions  Important, but rarely presented  Must check that the assumptions of the model are met! Model fit (predicted vs. observed) Outliers Influential observations  Key – is any one observation “driving” the results?

Wrapping up – Data analysis  Start with descriptives should tell most of the story most of the time should inform the next step (modelling)  For RCTs Primary analysis is usually ITT, unadjusted, simple test of the outcome of interest  What test depends on the type of outcome Secondary analyses may include adjusting for confounders, per protocol, subgroups  use regression appropriate for the type of outcome

Wrapping up – Data analysis 2  For observational studies, the primary analysis should adjust for confounders Use regression methods appropriate for the outcome and the study design  logistic regression for case control studies  conditional logistic regression for matched case control studies  Cox, Poisson, or logistic for cohort studies

Wrapping up – Data analysis 3  In all cases, consider: Functional form of exposure and covariates in regression How to choose confounders to include in the regression  Missing data  Checking assumptions

Software  R is available on the web Free Flexible Lots of online training resources

Binary or categorical outcomes (proportions) Outcome Variable Are the observations correlated?Alternative to the chi- square test if sparse cells: independentcorrelated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariable technique used when outcome is binary; gives multivariable-adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariable regression technique for a binary outcome when groups are correlated (e.g., matched data) Mixed models/GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independentcorrelated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariable regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non- parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Extra slides – technical stuff

Continuous outcome: T-test for two means  Is the difference in means that we observe between two groups more than we’d expect to see based on chance alone?  H0:  1 = 2  HA:  1 ≠ 2

T-test, independent samples, pooled variance If you assume that the standard deviation of the characteristic is the same in both groups, you can pool all the data to estimate a common standard deviation. This maximizes your degrees of freedom (and thus your power). Degrees of Freedom!

T-test, pooled variances

65  2 Test of Independence for Proportions  Does a relationship exist between 2 categorical variables?  Assumptions Multinomial experiment All expected counts  5  Uses two-way contingency table

 2 Test of Independence Hypotheses & Statistic Hypotheses: H0: Variables Are Independent HA: Variables Are Related (Dependent) Test Statistic: Degrees of freedom: (r - 1)(c - 1) Observed count Expected count

Expected Count Calculation

 You randomly sample 286 sexually active individuals and collect information on their HIV status and History of STDs. At the.05 level, is there evidence of a relationship?  2 Test of Independence Example on HIV

 2 Test of Independence Solution H 0 : No Relationship H a : Relationship  =.05 df = (2 - 1)(2 - 1) = 1 Critical Value(s): Test Statistic: Decision: Conclusion: Reject at  =.05 There is evidence of a relationship  =.05   2 = 54.29