BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Roadmap for Today Some More Advanced Statistical Models  Multiple Linear Regression  Generalized linear models –Logistic Regression –Poisson Regression –Survival Analysis Multivariate Data Analysis Programming Tutorials Bits & Pieces

Tutorial 4

Multiple Linear Regression Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")

Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where

Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Fit the Logistic Regression Model > anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 ** conc 5.567 2.044 2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.

Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.

Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

Survival Analysis Fit a Cox proportional hazards model coxfit1 <- coxph(Surv(time, status)~x, data=aml) summary(coxfit1) Cumulative baseline hazard estimator: basehaz(coxph(Surv(time, status)~x, data=aml)) Survival function for one group: plot(survfit(coxfit1, newdata=data.frame(x=1)))

Tutorial 5

Cluster Analysis Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models… A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Clustering observations on the basis of experiments or across a time series. Clustering experiments together on the basis of observations.

Examples of Clustering Algorithms Available in R Hierarchical Methods: hclust agnes Partitioning Methods: som kmeans pam Packages: cluster Different Samples Observations

Hierarchical Clustering n genes in n clusters n genes in 1 cluster divisive agglomerative We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation Source: J-Express Manual

Single linkage Complete linkage Average linkage Different Ways to Determine Distances Between Clusters

Partitioning Methods Examples of partitioning methods are k-means, partitioning about medoids (pam). Gap statistic: source("http://www.bioconductor.org/biocLite.R") biocLite("SAGx") ?gap The goal is to minimize the gap statistic.

W – within variance B – between variance K-means Clustering Reference: J-Express manual

241 genes from 19 cell samples into 6 clusters.

Classification (Machine Learning) Machine learning algorithms predict new classes based on patterns discerned from existing data. Classification algorithms are a form of supervised learning. Clustering algorithms are a form of unsupervised learning. R Package: class – contains knn, SOM nnet MLInterfaces - Biconductor A simplified way to construct machine learning algorithms from microarray data. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer).

Classification Linear Discriminant Analysis lda Support Vector Machines library(e1071) svm K-nearest neighbors knn Tree-based methods: rpart randomForest

Scaling Methods Principal Component Analysis prcomp Multi-dimensional Scaling MDS Self Organizing Maps SOM Independent Component Analysis fastICA

R Shortcuts Ctrl + A: Ctrl + E: Ctrl + K Esc {Up, Down} Arrow

Laundry List.Rprofile file Outline of R packages Graphics – lattice, Rwiki Homework R/SAS/Stata Comparison Exercises

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Similar presentations

Presentation on theme: "BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Similar presentations

Presentation on theme: "BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback