Unit 3, Session 1 Statistical Models
Outline Logistic Regression Linear Regression ROC curve and AUC Linear Regression Kaplan-Meier plot and log-rank test Cox Proportional hazards model
Logistic Model Logistic model is used for case/control study Usage scenario: when the response is binary, say, disease/healthy or recurrence/non-recurrence log 𝑝 𝑠𝑡𝑎𝑡𝑢𝑠=𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 1−𝑝 𝑠𝑡𝑎𝑡𝑢𝑠=𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 = β 0 + 𝛽 1 𝑥 1 +⋯+ 𝛽 𝑛 𝑥 𝑛 Where 𝑥 𝑖 are predictors and 𝛽 𝑖 are the parameters of interest
Linear Model Response: continuous, say weight, or gene expression. Predictors: any variables (say gene expression) Model 𝑦= β 0 + 𝛽 1 𝑥 1 +⋯+ 𝛽 𝑛 𝑥 𝑛 +𝜖 Assumptions: error term 𝜖∼𝑖𝑖𝑑 𝑁 0, 𝜎 2
Survival Methods Kaplan-Meier plot: visually checking the survival curve between groups Cox Proportional hazards model and log-rank test as formal statistical test Response: survival time (say DFS) and censor Predictors: any variables (say group or specific genes) Recurrence: censor = 1 and Non-recurrence: censor = 0
Load data Toy example data toy_data<- read.csv("toy_example_data.csv")
Logistic Model Response --- recurrence/non-recurrence status Predictor --- the expression of gene HOXB13 # logistic regresion, use gene HOXB13 to predict the recur/non-recur status fit.logistic <- glm(status~ gene_HOXB13,data = toy_data,family = binomial(link = 'logit')) summary(fit.logistic) #plot ROC curvep <- predict(fit.logistic, type="response") pr <- prediction(p, toy_data$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main="ROC plot of logistic regression") # calculate the auc auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]]auc
Logistic Regression Result
ROC Curve
Linear Model Response --- expression of HOXB13 Predictor --- expression of IL17BR # linear model, use gene IL17BR to predict another gene HOXB13 HOXB13fit.lm<- lm(gene_HOXB13~gene_IL17BR,data = data_toy) summary(fit.lm)
Linear Regression Result
Kaplan-Meier Plot We use Kaplan-Meier plot and log-rank test to check whether the survival time is significantly different from each other between groups (say high/low ratio group) ratio.surv <- survfit(Surv(time,censor) ~ ratio_group, data = toy_data) autoplot(ratio.surv,pVal = T,pX=0.25,pY =0.25,title = paste0("Kaplan-Meier plot of toy example "),yLab = "Survival Probability")
Kaplan-Meier Plot
Cox Proportional Hazards Model We use high/low ratio group to predict the survival probability. Here the response is the survival time and the censor information fit.cox <- coxph(Surv(time,censor) ~ group, data = toy_data) summary(fit.cox)
Cox Model Result
Data Downloading, Processing and Analysis
Outline Download data Parsing data Normalization Variance based filtering (top 25%) T test based filtering(based on the P-value cutoff) The above steps are implemented in “get_DEG_table.R” script.
Data Availability Microdissected dataset GSE1378: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1378 Whole tissue dataset: GSE1379: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1379 The easiest way to download data is using “getGEO” function from “GEOquery” package
Use “getGEO” to Download Data We have downloaded the data, you can use “getGEO” function to get data locally or online Local (loading_method = ‘local’) geo_Name <- ‘GSE1378’ geodata2 <-getGEO(filename paste0('geo_data/',geo_Name,'_series_matrix.txt.gz'), GSEMatrix = TRUE) Online (loading_method = ‘online’) geodata <- getGEO(geo_Name, GSEMatrix = TRUE,destdir = "geo_data") You can set loading_method variable in the get_DEG_table function to ”local” or “online” to change the way of downloading data Note that the downloaded geno matrix is in log2 scale
Parsing Data Extract the geno matrix, pheno table and feature table idx <- 1 ;geno <- assayData(geodata[[idx]])$exprs pheno <- pData(phenoData(geodata[[idx]])) feature <- as(featureData(geodata[[idx]]), 'data.frame') Parsing phenotype table to get variable Age, Size, DFS, censor infos_df$Age = as.numeric(unlist(strsplit(infos_df$X9, split = "="))[seq(2, 2 * n, 2)]) infos_df$Size = as.numeric(unlist(strsplit(infos_df$X3, split = "="))[seq(2, 2 *n, 2)]) infos_df$DFS = as.numeric(unlist(strsplit(infos_df$X10, split = "="))[seq(2, 2 * n, 2)]) infos_df$censor = ifelse(infos_df$status == "Status=recur", 1, 0)
Normalization Gene wise normalization (subtract the median log2 value) tmp_gm <- apply(geno, 2, median) geno <- geno - matrix(rep(1, numOfGene), numOfGene, 1) %*% matrix(tmp_gm, 1, n) Sample wise normalization (divided by mean value in original scale) geno <- apply(geno, c(1, 2), function(x) { 2 ^ x }) geno <- t(apply(geno, 1, function(x) { x / (mean(x)) })) geno <- apply(geno, c(1, 2), function(x) { log2(x) })
Variance Based Filtering Calculate the variance for each gene and choose the top 25% # variance based filtering (75th percentile) var_geno <- apply(geno, 1, var) var_filtered_idx <- var_geno > quantile(var_geno, 0.75) feature_var_filtered <- feature[var_filtered_idx,] geno_var_filtered <- geno[var_filtered_idx,]
T test Based Filtering For each gene, do T test between the recurrence and non-recurrence group. The status variable indicates the group information tmp_test <- t.test(gene_express ~ status, data = sdata, alternative = "two.sided") pvalue_list[i] <- tmp_test$p.value Fitering the gene by the P-value cutoff ttest_filtered_idx <- which(pvalue_list < cutoff) feature_ttest_filtered <- feature_var_filtered[ttest_filtered_idx,] geno_ttest_filtered <- geno_var_filtered[ttest_filtered_idx,]
Sample Results (GSE1378,microdissected, 0.0011 cutoff)
Sample Results (GSE1379, whole tissue dataset, cutoff 0.0011)
Statistical Modeling (examples)
Outline Select overlapped genes between GSE1378 and GSE1379 for subsequent analysis Heatmap and Dendrogram Univariate logistic regression for selected genes and two-gene ratio predictor Multivariate logistic regression (size and the other two potential predictors) Survival analysis part 1: Kaplan-Meier plot Survival analysis part 2: Cox proportional hazards model
Overlapped Genes In the prepossessing step, we obtained two DEG tables for the datasets GSE1378 and GSE1379 We used the overlapped genes in this two DEG tables for the subsequent analysis GSE1378: Micro-dissected breast cancer cell (LCM) GSE1379: Whole tissue section The overlapped genes are HOXB13 (identified twice as AI208111 and BC007092), IL17BR (AF2080111) and AI240933 (EST) We will study the prognostic value of these markers
Heatmap and Dendrogram We use Heatmap and Dendrogram to Visually check the relationship (correlation) among genes or samples
Heatmap (microdissected,GSE1378) consistent with the paper
Heatmap (whole section tissue, GSE 1379)
Model Set 1 Univariate logistic regression for each gene Response variable: recur/non-recur status Predictors: one of the overlapped genes, HOXB13 / IL17BR(AF2080111) / AI240933(EST)
Model Set 2 Univariate logistic regression for ratio of genes Response variable : recur/non-recur status Predictors : HOXB13:IL17BR
Model Set 3 Multivariate logistic regression Response variable : recur/non-recur Predictors: tumor size, HOXB13:IL17BR, PGR and ERBB2
Model Set 4 Survival model Response variable: DFS (disease free survival time), censor Predictor: use “-intercept/beta” from logistic regression as the cutoff to divide the sample into two groups: high ratio group and low ratio group
Important Note Please remember there are two datasets GSE1378 and GSE1379 Can fit the same sets of model on these two datasets Need to set the working dataset variable working_dataset = "GSE1378" #whole tissue section,GSE1379 #working_dataset = "GSE1378" #microdissected breast cancer cells, GSE1378 Use working dataset GSE1378 as example
Univariate Logistic Regression for Each Gene As an example, we check the gene HOXB13 gb_acc = "BC007092" # HOXB13 geno_selected = geno[which(feature$GB_ACC == gb_acc),] logit_data = data.frame(status = infos_df$status,gene = geno_selected ) fit <- glm(status~ geno_selected,data = logit_data,family = binomial(link = 'logit')) p <- predict(fit, type="response") pr <- prediction(p, infos_df$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main=paste0("ROC plot of gene ",gb_acc)) auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] auc
Sample Output (gene HOXB13 )
ROC (auc 0.796, gene HOXB13 )
Univariate Logistic Regression (HOXB13:IL17BR) gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio) # fit the model fit <- glm(status~ gene_ratio,data = logit_data,family = binomial(link = 'logit')) summary(fit)
Sample Output (HOXB13:IL17BR)
ROC (auc=0.84, HOXB13:IL17BR)
Multivariate Logistic Regression (tumor size, gene ratio, PGR, ERBB2) gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR gene_name3 = "PGR_3UTR1" # PGR gene_name4 = "BF108852" # ERBB2 geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] geno_selected3 = geno[which(feature$GeneName == gene_name3),] geno_selected4 = geno[which(feature$GeneName == gene_name4),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,size = infos_df$Size,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio,gene3= geno_selected3,gene4= geno_selected4) # fit the multinvariate logistic regression fit <- glm(status~ gene_ratio+size+gene3+gene4,data = logit_data,family = binomial(link = 'logit')) summary(fit)
Sample Output (Multivariate)
ROC (auc = 0.86, Multivariate )
Kaplan-Meier Plot (gene ratio high/low group, cutoff = -1.2)
Cox Proportional Hazards Model (gene ratio high/low group, cutoff = -1 fit.cox <- coxph(Surv(time,censor) ~ group, data = surv_data) summary(fit.cox)
Sample Output (Cox)
Validation: GSE6532 The link to this dataset http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532 Sample size:87 Number of total markers: 54675 Gene HOXB13,IL17RB and ESTs are included in this dataset. We use this dataset as validation. Result: They are not significant on this independent set.