Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.

Slides:

Advertisements

Similar presentations

Advertisements

BioInformatics (3).

Discrimination and Classification. Discrimination Situation: We have two or more populations  1,  2, etc (possibly p-variate normal). The populations.

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

Hierarchical Clustering

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.

Introduction to Bioinformatics

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.

PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 3 Chicago School of Professional Psychology.

What is Cluster Analysis?

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Segmentation Analysis

Gene expression profiling identifies molecular subtypes of gliomas

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.

Lecture 11. Microarray and RNA-seq II

A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

11/18/2015 IENG 486 Statistical Quality & Process Control 1 IENG Lecture 07 Comparison of Location (Means)

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Selecting Diverse Sets of Compounds C371 Fall 2004.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D April 2016.

Statistical Programming Using the R Language Lecture 4 Experimental Design & ANOVA Darren J. Fitzpatrick, Ph.D April 2016.

Multivariate statistical methods Cluster analysis.

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Statistical Programming Using the R Language Lecture 2 Basic Concepts II Darren J. Fitzpatrick, Ph.D April 2016.

Unsupervised Learning

Statistical Programming Using the R Language

Statistical Programming Using the R Language

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Statistical Programming Using the R Language

Clustering Manpreet S. Katari.

Statistical Programming Using the R Language

Data Mining K-means Algorithm

Discrimination and Classification

TM 720: Statistical Process Control

K-means and Hierarchical Clustering

Quality Control at a Local Brewery

Clustering and Multidimensional Scaling

Multivariate Statistical Methods

(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.

Clustering The process of grouping samples so that the samples are similar within each group.

SEEM4630 Tutorial 3 – Clustering.

Hierarchical Clustering

Unsupervised Learning

Presentation transcript:

Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016

Trinity College Dublin, The University of Dublin Solutions I 1 install.packages('pwr') library(pwr) pwr.anova.test(k = 4, f=0.5, sig.level=0.05, power=0.8) Balanced one-way analysis of variance power calculation k = 4 n = f = 0.5 sig.level = 0.05 power = 0.8 NOTE: n is number in each group

Trinity College Dublin, The University of Dublin Solutions II 2.2 anova(lm(Guanylin~TNP, data=df)) Analysis of Variance Table Response: Guanylin Df Sum Sq Mean Sq F value Pr(>F) TNP e-09 *** Residuals

Trinity College Dublin, The University of Dublin Solutions III 2.2 pairwise.t.test(df$Guanylin, df$TNP, p.adjust.method='BH') Pairwise comparisons using t tests with pooled SD data: df$Guanylin and df$TNP N_M N_W T_M N_W T_M 3.5e T_W 8.9e P value adjustment method: BH

Trinity College Dublin, The University of Dublin Solutions III 2.4 boxplot(df$Guanylin~df$TNP)

Trinity College Dublin, The University of Dublin What is multivariate data? Multivariate data is any data for which there are numerous measures/variables measured from a single sample. X53416M83670X90908M97496X Normal_ Normal_ Normal_ Normal_ It is often called multidimensional data.

Trinity College Dublin, The University of Dublin Clustering Clustering is a means of grouping data such that variables in the same cluster are more similar to each other than to the variables in another cluster. It is a form of unsupervised learning, i.e., the data has no category information. Use only the relationship between the data points, clustering, irrespective of the method, attempts to organise the data into groups. It is up to the researcher to decide if the clusters have any biological or other meaning by doing downstream analysis of the clusters, e.g., GO term enrichment, pathway analysis, etc.

Trinity College Dublin, The University of Dublin Hierarchical Clustering I For a set of N samples to be clustered and an N x N distance matrix: 1.Assign each item to a cluster such that you have N clusters containing a single item. 2.Using the distance matrix, merge the two most similar samples such that you have N-1 clusters, one of which contains two samples. 3.Compute the distance between the new cluster and each of the remaining clusters and merge the most similar clusters. 4.Repeat 2 and 3 until all items are clustered into a single cluster of size N. The Algorithm

Trinity College Dublin, The University of Dublin Hierarchical Clustering II df <- read.table('colon_cancer_data_set.txt', header=T) unaffected <- df[which(df$Status=='U'), 1:7464] for_cluster <- unaffected[, 1:5] # Example for 5 genes X53416M83670X90908M97496X Normal_ Normal_ Normal_ Normal_

Trinity College Dublin, The University of Dublin Hierarchical Clustering III To perform clustering, we first need to compute a distance matrix. dmat <- dist(for_cluster, method='euclidean') X53416M83670X90908M97496X Normal_ Normal_ Normal_ Normal_ Normal_27Normal_29Normal_34 Normal_ Normal_ Normal_ Original Data Distance Matrix ( dmat )

Trinity College Dublin, The University of Dublin Hierarchical Clustering IV Distance metric are a way of summarising the similarity between multiple observations. There are numerous formulae for computing such differences but the most commonly used is the Euclidean distance. For the other methods, look up the help for the dist() function. A Note on Distance Measures Euclidean distance for 2 dimensional data (x, y) is just the distance between two points on a line.

Trinity College Dublin, The University of Dublin Hierarchical Clustering V A Note on Distance Measures X53416M83670X90908M97496X Normal_ Normal_ Normal_ Normal_ Normal_27Normal_29Normal_34 Normal_ Normal_ Normal_ Euclidean distance in multivariate data is the generalised form of the 2D example below. Distance metrics produce a single measure of similarity between samples based on multiple measurements. They produce a symmetrical (N x N) distance matrix.

Trinity College Dublin, The University of Dublin Hierarchical Clustering VI Next, we cluster the data. dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) Normal_4 appears to be an outlier. At the very least, he/she is different.

Trinity College Dublin, The University of Dublin Hierarchical Clustering VII A Note on Linkage In determining clusters, linkage is a measure of one clusters similarity to another. hclust(dmat, method=c('average', 'single', 'complete'))

Trinity College Dublin, The University of Dublin Hierarchical Clustering VIII In hierarchical clustering, you have to determine what constitutes a cluster yourself. R has functions to help extract clusters. clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red')

Trinity College Dublin, The University of Dublin Hierarchical Clustering IX clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') The cutree() function returns the clusters and their members. cluster_1 <- names(which(clusters==1)) cluster_2 <- names(which(clusters==2)) Note: Clusters are labeled numerically, in this case 1 and 2 in order of size (largest to smallest).

Trinity College Dublin, The University of Dublin Hierarchical Clustering X X53416M83670X90908M97496X Normal_ Normal_ Normal_ Normal_ Normal_27Normal_29Normal_34 Normal_ Normal_ Normal_ dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') In Four Lines of Code!

Trinity College Dublin, The University of Dublin Hierarchical Clustering XI Cluster genes using correlation as a distance measure. 1. Compute distance matrix tdmat <- as.dist(cor(for_cluster, method='spearman')) X53416M M X M X

Trinity College Dublin, The University of Dublin Hierarchical Clustering XII Cluster genes using correlation as a distance measure. cclust <- hclust(tdmat, method='average') plot(cclust) 2. Run clustering algorithm. 3. Look at clusters cclusters <- cutree(cclust, k=2) rect.hclust(cclust, k=2, border='red')

Trinity College Dublin, The University of Dublin Heatmaps I Dendrograms are a way of visualising relationships in multivariate data. Heatmaps can also be used to visualise multivariate data. Heatmaps and Dendrograms can be combined to create informative visualisations.

Trinity College Dublin, The University of Dublin Heatmaps II Bioconductor is a repository of R packages for analysing biological data. We are going to use the heatplus package in bioconductor to make heatmaps source(" biocLite("Heatplus") library(Heatplus) To Install Heatplus.

Trinity College Dublin, The University of Dublin Heatmaps III Documentation and examples for bioconductor packages are always on the package homepage.

Trinity College Dublin, The University of Dublin Heatmaps IV The Heatplus package has a function called regHeatmap() to make heatmaps. This function enables us to cluster genes and samples using any distance metric and any linkage metric. The body of the heatmaps are colour intensities which represent the original data.

Trinity College Dublin, The University of Dublin Heatmaps V Draw a heatmap of the first 50 genes from the unaffected gene expression data. The default approach uses Euclidean distance and complete linkage to make the dendrograms. h1 <- regHeatmap(as.matrix(unaffected[,1:50])) plot(h1)

Trinity College Dublin, The University of Dublin Heatmaps VI di <- function(x) dist(x, method='euclidean') cl <- function(x) hclust(x, method='average') h3 <- regHeatmap(as.matrix(unaffected[,1:50]), legend=2, dendrogram=list(clustfun=cl, distfun=di)) plot(h3) Explicitly program the heatmap function to make dendrograms using Euclidean distance and average linkage. Compared to the default, the dendrogram shape (complete linkage) changes a little but the clusters are the similar in this average linkage example.

Trinity College Dublin, The University of Dublin Heatmaps VII di <- function(x) as.dist(1-abs(cor(t(x), method='spearman'))) cl <- function(x) hclust(x, method='average') h4 <- regHeatmap(as.matrix(unaffected[,1:50]), legend=2, dendrogram=list(clustfun=cl, distfun=di)) plot(h4) Make a heatmap using 1 - |r| as a dissimilarity measure.

Trinity College Dublin, The University of Dublin Lecture 5 Problem Sheet A problem sheet entitled lecture_5_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next mins. We will be on hand to help out. Solutions will be posted this afternoon.

Thank You