Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Similar presentations


Presentation on theme: "BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"— Presentation transcript:

1 BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

2 Roadmap for Today Some More Advanced Statistical Models  Multiple Linear Regression  Generalized linear models –Logistic Regression –Poisson Regression –Survival Analysis Multivariate Data Analysis Programming Tutorials Bits & Pieces

3 Tutorial 4

4 Multiple Linear Regression Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")

5 Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

6 Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where

7 Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

8 Fit the Logistic Regression Model > anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 ** conc 5.567 2.044 2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

9 Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

10 Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.

11 Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.

12 Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

13 Survival Analysis Fit a Cox proportional hazards model coxfit1 <- coxph(Surv(time, status)~x, data=aml) summary(coxfit1) Cumulative baseline hazard estimator: basehaz(coxph(Surv(time, status)~x, data=aml)) Survival function for one group: plot(survfit(coxfit1, newdata=data.frame(x=1)))

14 Tutorial 5

15 Cluster Analysis Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models… A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Clustering observations on the basis of experiments or across a time series. Clustering experiments together on the basis of observations.

16 Examples of Clustering Algorithms Available in R Hierarchical Methods: hclust agnes Partitioning Methods: som kmeans pam Packages: cluster Different Samples Observations

17 Hierarchical Clustering n genes in n clusters n genes in 1 cluster divisive agglomerative We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation Source: J-Express Manual

18 Single linkage Complete linkage Average linkage Different Ways to Determine Distances Between Clusters

19 Partitioning Methods Examples of partitioning methods are k-means, partitioning about medoids (pam). Gap statistic: source("http://www.bioconductor.org/biocLite.R") biocLite("SAGx") ?gap The goal is to minimize the gap statistic.

20 W – within variance B – between variance K-means Clustering Reference: J-Express manual

21 241 genes from 19 cell samples into 6 clusters.

22 Classification (Machine Learning) Machine learning algorithms predict new classes based on patterns discerned from existing data. Classification algorithms are a form of supervised learning. Clustering algorithms are a form of unsupervised learning. R Package: class – contains knn, SOM nnet MLInterfaces - Biconductor A simplified way to construct machine learning algorithms from microarray data. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer).

23 Classification Linear Discriminant Analysis lda Support Vector Machines library(e1071) svm K-nearest neighbors knn Tree-based methods: rpart randomForest

24 Scaling Methods Principal Component Analysis prcomp Multi-dimensional Scaling MDS Self Organizing Maps SOM Independent Component Analysis fastICA

25 R Shortcuts Ctrl + A: Ctrl + E: Ctrl + K Esc {Up, Down} Arrow

26 Laundry List.Rprofile file Outline of R packages Graphics – lattice, Rwiki Homework R/SAS/Stata Comparison Exercises


Download ppt "BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"

Similar presentations


Ads by Google