BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

Logistic Regression Psy 524 Ainsworth.
An Overview of Machine Learning
BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink
Chapter 17 Overview of Multivariate Analysis Methods
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Multivariate Methods Pattern Recognition and Hypothesis Testing.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
BIO503: Lecture 4 Statistical models in R Stefan Bentink
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Simple Linear Regression
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Limited Dependent Variables Ciaran S. Phibbs May 30, 2012.
Logistic Regression Database Marketing Instructor: N. Kumar.
Different Distributions David Purdie. Topics Application of GEE to: Binary outcomes: – logistic regression Events over time (rate): –Poisson regression.
Linear correlation and linear regression + summary of tests
A Short Overview of Microarrays Tex Thompson Spring 2005.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Lecture 12: Cox Proportional Hazards Model
Limited Dependent Variables Ciaran S. Phibbs. Limited Dependent Variables 0-1, small number of options, small counts, etc. 0-1, small number of options,
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Lecture 2: Statistical learning primer for biologists
Analyzing Expression Data: Clustering and Stats Chapter 16.
Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Factor & Cluster Analyses. Factor Analysis Goals Data Process Results.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 C.A.L. Bailer-Jones. Machine Learning. Unsupervised learning and clustering Machine learning, pattern recognition and statistical data modelling Lecture.
Semi-Supervised Clustering
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Quantitative Methods What lies beyond?.
PCA, Clustering and Classification by Agnieszka S. Juncker
What is Regression Analysis?
Quantitative Methods What lies beyond?.
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Dimension reduction : PCA and Clustering
Abdur Rahman Department of Statistics
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Junheng, Shengming, Yunsheng 11/09/2018
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
What is Artificial Intelligence?
Presentation transcript:

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Roadmap for Today Some More Advanced Statistical Models  Multiple Linear Regression  Generalized linear models –Logistic Regression –Poisson Regression –Survival Analysis Multivariate Data Analysis Programming Tutorials Bits & Pieces

Tutorial 4

Multiple Linear Regression Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")

Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where

Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Fit the Logistic Regression Model > anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) ** conc ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.

Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.

Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

Survival Analysis Fit a Cox proportional hazards model coxfit1 <- coxph(Surv(time, status)~x, data=aml) summary(coxfit1) Cumulative baseline hazard estimator: basehaz(coxph(Surv(time, status)~x, data=aml)) Survival function for one group: plot(survfit(coxfit1, newdata=data.frame(x=1)))

Tutorial 5

Cluster Analysis Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models… A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Clustering observations on the basis of experiments or across a time series. Clustering experiments together on the basis of observations.

Examples of Clustering Algorithms Available in R Hierarchical Methods: hclust agnes Partitioning Methods: som kmeans pam Packages: cluster Different Samples Observations

Hierarchical Clustering n genes in n clusters n genes in 1 cluster divisive agglomerative We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation Source: J-Express Manual

Single linkage Complete linkage Average linkage Different Ways to Determine Distances Between Clusters

Partitioning Methods Examples of partitioning methods are k-means, partitioning about medoids (pam). Gap statistic: source(" biocLite("SAGx") ?gap The goal is to minimize the gap statistic.

W – within variance B – between variance K-means Clustering Reference: J-Express manual

241 genes from 19 cell samples into 6 clusters.

Classification (Machine Learning) Machine learning algorithms predict new classes based on patterns discerned from existing data. Classification algorithms are a form of supervised learning. Clustering algorithms are a form of unsupervised learning. R Package: class – contains knn, SOM nnet MLInterfaces - Biconductor A simplified way to construct machine learning algorithms from microarray data. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer).

Classification Linear Discriminant Analysis lda Support Vector Machines library(e1071) svm K-nearest neighbors knn Tree-based methods: rpart randomForest

Scaling Methods Principal Component Analysis prcomp Multi-dimensional Scaling MDS Self Organizing Maps SOM Independent Component Analysis fastICA

R Shortcuts Ctrl + A: Ctrl + E: Ctrl + K Esc {Up, Down} Arrow

Laundry List.Rprofile file Outline of R packages Graphics – lattice, Rwiki Homework R/SAS/Stata Comparison Exercises