Unit 3, Session 1 Statistical Models.

Slides:



Advertisements
Similar presentations
Survival Analysis. Key variable = time until some event time from treatment to death time for a fracture to heal time from surgery to relapse.
Advertisements

Survival Analysis-1 In Survival Analysis the outcome of interest is time to an event In Survival Analysis the outcome of interest is time to an event The.
Logistic Regression Psy 524 Ainsworth.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Departments of Medicine and Biostatistics
HSRP 734: Advanced Statistical Methods July 24, 2008.
Statistical Tests Karen H. Hagglund, M.S.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Simple Linear Regression Analysis
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Prognostic Modelling and Profiling of Breast Cancer Patients after Surgery Ian Jarman School of Computer and Mathematical Sciences Liverpool John Moores.
Simple Linear Regression
Statistics for clinical research An introductory course.
Dr Laura Bonnett Department of Biostatistics. UNDERSTANDING SURVIVAL ANALYSIS.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
INDE 6335 ENGINEERING ADMINISTRATION SURVEY DESIGN Dr. Christopher A. Chung Dept. of Industrial Engineering.
Logistic Regression. Linear Regression Purchases vs. Income.
Multiple Logistic Regression STAT E-150 Statistical Methods.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
CORRELATION ANALYSIS.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
DEMONSTRATION OF USING SPSS Logistic Regression Models for Prediction 2016/11/71.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
BINARY LOGISTIC REGRESSION
Logistic Regression When and why do we use logistic regression?
Logistic Regression APKC – STATS AFAC (2016).
April 18 Intro to survival analysis Le 11.1 – 11.2
C Supplemental Figure S2.. C Supplemental Figure S2.
CHAPTER 7 Linear Correlation & Regression Methods
Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong
TBCRC (the translational breast cancer research consortium) 005 Prospective study
Inference for Regression
Statistics.
Multimodal Assessment of Estrogen Receptor mRNA Profiles to Quantify Estrogen Pathway Activity in Breast Tumors  Anita Muthukaruppan, Annette Lasham,
Statistical Inference for more than two groups
Hypothesis Tests: One Sample
Supporting Information for Meta-analysis
Y - Tests Type Based on Response and Measure Variable Data
Checking Regression Model Assumptions
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Simple Linear Regression
Statistics 103 Monday, July 10, 2017.
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Checking Regression Model Assumptions
Survival Analysis {Chapter 12}
Volume 5, Issue 6, Pages (June 2004)
Undergraduated Econometrics
What is Regression Analysis?
Volume 17, Issue 1, Pages (January 2010)
Volume 4, Issue 3, Pages (August 2013)
Statistics II: An Overview of Statistics
Somi Jacob and Christian Bach
15.1 The Role of Statistics in the Research Process
Additional Regression techniques
Where are we?.
LATS2-associated gene expression pattern is down-regulated specifically in lumB breast tumors. LATS2-associated gene expression pattern is down-regulated.
Regression Part II.
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
PD-L1 expression correlates with T-cell markers and an IFN response signature in human melanomas. PD-L1 expression correlates with T-cell markers and an.
Presentation transcript:

Unit 3, Session 1 Statistical Models

Outline Logistic Regression Linear Regression ROC curve and AUC Linear Regression Kaplan-Meier plot and log-rank test Cox Proportional hazards model

Logistic Model Logistic model is used for case/control study Usage scenario: when the response is binary, say, disease/healthy or recurrence/non-recurrence log 𝑝 𝑠𝑡𝑎𝑡𝑢𝑠=𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 1−𝑝 𝑠𝑡𝑎𝑡𝑢𝑠=𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 = β 0 + 𝛽 1 𝑥 1 +⋯+ 𝛽 𝑛 𝑥 𝑛 Where 𝑥 𝑖 are predictors and 𝛽 𝑖 are the parameters of interest

Linear Model Response: continuous, say weight, or gene expression. Predictors: any variables (say gene expression) Model 𝑦= β 0 + 𝛽 1 𝑥 1 +⋯+ 𝛽 𝑛 𝑥 𝑛 +𝜖 Assumptions: error term 𝜖∼𝑖𝑖𝑑 𝑁 0, 𝜎 2

Survival Methods Kaplan-Meier plot: visually checking the survival curve between groups Cox Proportional hazards model and log-rank test as formal statistical test Response: survival time (say DFS) and censor Predictors: any variables (say group or specific genes) Recurrence: censor = 1 and Non-recurrence: censor = 0

Load data Toy example data toy_data<- read.csv("toy_example_data.csv")

Logistic Model Response --- recurrence/non-recurrence status Predictor --- the expression of gene HOXB13 # logistic regresion, use gene HOXB13 to predict the recur/non-recur status fit.logistic <- glm(status~ gene_HOXB13,data = toy_data,family = binomial(link = 'logit')) summary(fit.logistic) #plot ROC curvep <- predict(fit.logistic, type="response") pr <- prediction(p, toy_data$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main="ROC plot of logistic regression") # calculate the auc auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]]auc

Logistic Regression Result

ROC Curve

Linear Model Response --- expression of HOXB13 Predictor --- expression of IL17BR # linear model, use gene IL17BR to predict another gene HOXB13 HOXB13fit.lm<- lm(gene_HOXB13~gene_IL17BR,data = data_toy) summary(fit.lm)

Linear Regression Result

Kaplan-Meier Plot We use Kaplan-Meier plot and log-rank test to check whether the survival time is significantly different from each other between groups (say high/low ratio group) ratio.surv <- survfit(Surv(time,censor) ~ ratio_group, data = toy_data) autoplot(ratio.surv,pVal = T,pX=0.25,pY =0.25,title = paste0("Kaplan-Meier plot of toy example "),yLab = "Survival Probability")

Kaplan-Meier Plot

Cox Proportional Hazards Model We use high/low ratio group to predict the survival probability. Here the response is the survival time and the censor information fit.cox <- coxph(Surv(time,censor) ~ group, data = toy_data) summary(fit.cox)

Cox Model Result

Data Downloading, Processing and Analysis

Outline Download data Parsing data Normalization Variance based filtering (top 25%) T test based filtering(based on the P-value cutoff) The above steps are implemented in “get_DEG_table.R” script.

Data Availability Microdissected dataset GSE1378: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1378 Whole tissue dataset: GSE1379: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1379 The easiest way to download data is using “getGEO” function from “GEOquery” package

Use “getGEO” to Download Data We have downloaded the data, you can use “getGEO” function to get data locally or online Local (loading_method = ‘local’) geo_Name <- ‘GSE1378’ geodata2 <-getGEO(filename paste0('geo_data/',geo_Name,'_series_matrix.txt.gz'), GSEMatrix = TRUE) Online (loading_method = ‘online’) geodata <- getGEO(geo_Name, GSEMatrix = TRUE,destdir = "geo_data") You can set loading_method variable in the get_DEG_table function to ”local” or “online” to change the way of downloading data Note that the downloaded geno matrix is in log2 scale

Parsing Data Extract the geno matrix, pheno table and feature table idx <- 1 ;geno <- assayData(geodata[[idx]])$exprs pheno <- pData(phenoData(geodata[[idx]])) feature <- as(featureData(geodata[[idx]]), 'data.frame') Parsing phenotype table to get variable Age, Size, DFS, censor infos_df$Age = as.numeric(unlist(strsplit(infos_df$X9, split = "="))[seq(2, 2 * n, 2)]) infos_df$Size = as.numeric(unlist(strsplit(infos_df$X3, split = "="))[seq(2, 2 *n, 2)]) infos_df$DFS = as.numeric(unlist(strsplit(infos_df$X10, split = "="))[seq(2, 2 * n, 2)]) infos_df$censor = ifelse(infos_df$status == "Status=recur", 1, 0)

Normalization Gene wise normalization (subtract the median log2 value) tmp_gm <- apply(geno, 2, median) geno <- geno - matrix(rep(1, numOfGene), numOfGene, 1) %*% matrix(tmp_gm, 1, n) Sample wise normalization (divided by mean value in original scale) geno <- apply(geno, c(1, 2), function(x) { 2 ^ x }) geno <- t(apply(geno, 1, function(x) { x / (mean(x)) })) geno <- apply(geno, c(1, 2), function(x) { log2(x) })

Variance Based Filtering Calculate the variance for each gene and choose the top 25% # variance based filtering (75th percentile) var_geno <- apply(geno, 1, var) var_filtered_idx <- var_geno > quantile(var_geno, 0.75) feature_var_filtered <- feature[var_filtered_idx,] geno_var_filtered <- geno[var_filtered_idx,]

T test Based Filtering For each gene, do T test between the recurrence and non-recurrence group. The status variable indicates the group information tmp_test <- t.test(gene_express ~ status, data = sdata, alternative = "two.sided") pvalue_list[i] <- tmp_test$p.value Fitering the gene by the P-value cutoff ttest_filtered_idx <- which(pvalue_list < cutoff) feature_ttest_filtered <- feature_var_filtered[ttest_filtered_idx,] geno_ttest_filtered <- geno_var_filtered[ttest_filtered_idx,]

Sample Results (GSE1378,microdissected, 0.0011 cutoff)

Sample Results (GSE1379, whole tissue dataset, cutoff 0.0011)

Statistical Modeling (examples)

Outline Select overlapped genes between GSE1378 and GSE1379 for subsequent analysis Heatmap and Dendrogram Univariate logistic regression for selected genes and two-gene ratio predictor Multivariate logistic regression (size and the other two potential predictors) Survival analysis part 1: Kaplan-Meier plot Survival analysis part 2: Cox proportional hazards model

Overlapped Genes In the prepossessing step, we obtained two DEG tables for the datasets GSE1378 and GSE1379 We used the overlapped genes in this two DEG tables for the subsequent analysis GSE1378: Micro-dissected breast cancer cell (LCM) GSE1379: Whole tissue section The overlapped genes are HOXB13 (identified twice as AI208111 and BC007092), IL17BR (AF2080111) and AI240933 (EST) We will study the prognostic value of these markers

Heatmap and Dendrogram We use Heatmap and Dendrogram to Visually check the relationship (correlation) among genes or samples

Heatmap (microdissected,GSE1378) consistent with the paper

Heatmap (whole section tissue, GSE 1379)

Model Set 1 Univariate logistic regression for each gene Response variable: recur/non-recur status Predictors: one of the overlapped genes, HOXB13 / IL17BR(AF2080111) / AI240933(EST)

Model Set 2 Univariate logistic regression for ratio of genes Response variable : recur/non-recur status Predictors : HOXB13:IL17BR

Model Set 3 Multivariate logistic regression Response variable : recur/non-recur Predictors: tumor size, HOXB13:IL17BR, PGR and ERBB2

Model Set 4 Survival model Response variable: DFS (disease free survival time), censor Predictor: use “-intercept/beta” from logistic regression as the cutoff to divide the sample into two groups: high ratio group and low ratio group

Important Note Please remember there are two datasets GSE1378 and GSE1379 Can fit the same sets of model on these two datasets Need to set the working dataset variable working_dataset = "GSE1378" #whole tissue section,GSE1379 #working_dataset = "GSE1378" #microdissected breast cancer cells, GSE1378 Use working dataset GSE1378 as example

Univariate Logistic Regression for Each Gene As an example, we check the gene HOXB13 gb_acc = "BC007092" # HOXB13 geno_selected = geno[which(feature$GB_ACC == gb_acc),] logit_data = data.frame(status = infos_df$status,gene = geno_selected ) fit <- glm(status~ geno_selected,data = logit_data,family = binomial(link = 'logit')) p <- predict(fit, type="response") pr <- prediction(p, infos_df$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main=paste0("ROC plot of gene ",gb_acc)) auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] auc

Sample Output (gene HOXB13 )

ROC (auc 0.796, gene HOXB13 )

Univariate Logistic Regression (HOXB13:IL17BR) gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio) # fit the model fit <- glm(status~ gene_ratio,data = logit_data,family = binomial(link = 'logit')) summary(fit)

Sample Output (HOXB13:IL17BR)

ROC (auc=0.84, HOXB13:IL17BR)

Multivariate Logistic Regression (tumor size, gene ratio, PGR, ERBB2) gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR gene_name3 = "PGR_3UTR1" # PGR gene_name4 = "BF108852" # ERBB2 geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] geno_selected3 = geno[which(feature$GeneName == gene_name3),] geno_selected4 = geno[which(feature$GeneName == gene_name4),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,size = infos_df$Size,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio,gene3= geno_selected3,gene4= geno_selected4) # fit the multinvariate logistic regression fit <- glm(status~ gene_ratio+size+gene3+gene4,data = logit_data,family = binomial(link = 'logit')) summary(fit)

Sample Output (Multivariate)

ROC (auc = 0.86, Multivariate )

Kaplan-Meier Plot (gene ratio high/low group, cutoff = -1.2)

Cox Proportional Hazards Model (gene ratio high/low group, cutoff = -1 fit.cox <- coxph(Surv(time,censor) ~ group, data = surv_data) summary(fit.cox)

Sample Output (Cox)

Validation: GSE6532 The link to this dataset http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532 Sample size:87 Number of total markers: 54675 Gene HOXB13,IL17RB and ESTs are included in this dataset. We use this dataset as validation. Result: They are not significant on this independent set.