Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.

Similar presentations


Presentation on theme: "Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016."— Presentation transcript:

1 Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016

2 Trinity College Dublin, The University of Dublin Solutions I 1 install.packages('pwr') library(pwr) pwr.anova.test(k = 4, f=0.5, sig.level=0.05, power=0.8) Balanced one-way analysis of variance power calculation k = 4 n = 11.92613 f = 0.5 sig.level = 0.05 power = 0.8 NOTE: n is number in each group

3 Trinity College Dublin, The University of Dublin Solutions II 2.2 anova(lm(Guanylin~TNP, data=df)) Analysis of Variance Table Response: Guanylin Df Sum Sq Mean Sq F value Pr(>F) TNP 3 10618418 3539473 31.43 1.164e-09 *** Residuals 32 3603655 112614

4 Trinity College Dublin, The University of Dublin Solutions III 2.2 pairwise.t.test(df$Guanylin, df$TNP, p.adjust.method='BH') Pairwise comparisons using t tests with pooled SD data: df$Guanylin and df$TNP N_M N_W T_M N_W 0.00484 - - T_M 3.5e-09 0.00066 - T_W 8.9e-08 0.00155 0.88366 P value adjustment method: BH

5 Trinity College Dublin, The University of Dublin Solutions III 2.4 boxplot(df$Guanylin~df$TNP)

6 Trinity College Dublin, The University of Dublin What is multivariate data? Multivariate data is any data for which there are numerous measures/variables measured from a single sample. X53416M83670X90908M97496X90908.1 Normal_2 7 1314320212313 Normal_2 9..... Normal_3 4........ Normal_4 0..... It is often called multidimensional data.

7 Trinity College Dublin, The University of Dublin Clustering Clustering is a means of grouping data such that variables in the same cluster are more similar to each other than to the variables in another cluster. It is a form of unsupervised learning, i.e., the data has no category information. Use only the relationship between the data points, clustering, irrespective of the method, attempts to organise the data into groups. It is up to the researcher to decide if the clusters have any biological or other meaning by doing downstream analysis of the clusters, e.g., GO term enrichment, pathway analysis, etc.

8 Trinity College Dublin, The University of Dublin Hierarchical Clustering I For a set of N samples to be clustered and an N x N distance matrix: 1.Assign each item to a cluster such that you have N clusters containing a single item. 2.Using the distance matrix, merge the two most similar samples such that you have N-1 clusters, one of which contains two samples. 3.Compute the distance between the new cluster and each of the remaining clusters and merge the most similar clusters. 4.Repeat 2 and 3 until all items are clustered into a single cluster of size N. The Algorithm

9 Trinity College Dublin, The University of Dublin Hierarchical Clustering II df <- read.table('colon_cancer_data_set.txt', header=T) unaffected <- df[which(df$Status=='U'), 1:7464] for_cluster <- unaffected[, 1:5] # Example for 5 genes X53416M83670X90908M97496X90908.1 Normal_2 7 1314320212313 Normal_2 9..... Normal_3 4........ Normal_4 0.....

10 Trinity College Dublin, The University of Dublin Hierarchical Clustering III To perform clustering, we first need to compute a distance matrix. dmat <- dist(for_cluster, method='euclidean') X53416M83670X90908M97496X90908.1 Normal_ 27 1314320212313 Normal_ 29..... Normal_ 34........ Normal_ 40..... Normal_27Normal_29Normal_34 Normal_29744.98 Normal_341046.251222.16 Normal_28860.541434.15708.68 Original Data Distance Matrix ( dmat )

11 Trinity College Dublin, The University of Dublin Hierarchical Clustering IV Distance metric are a way of summarising the similarity between multiple observations. There are numerous formulae for computing such differences but the most commonly used is the Euclidean distance. For the other methods, look up the help for the dist() function. A Note on Distance Measures Euclidean distance for 2 dimensional data (x, y) is just the distance between two points on a line.

12 Trinity College Dublin, The University of Dublin Hierarchical Clustering V A Note on Distance Measures X53416M83670X90908M97496X90908.1 Normal_ 27 1314320212313 Normal_ 29..... Normal_ 34........ Normal_ 40..... Normal_27Normal_29Normal_34 Normal_29744.98 Normal_341046.251222.16 Normal_28860.541434.15708.68 Euclidean distance in multivariate data is the generalised form of the 2D example below. Distance metrics produce a single measure of similarity between samples based on multiple measurements. They produce a symmetrical (N x N) distance matrix.

13 Trinity College Dublin, The University of Dublin Hierarchical Clustering VI Next, we cluster the data. dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) Normal_4 appears to be an outlier. At the very least, he/she is different.

14 Trinity College Dublin, The University of Dublin Hierarchical Clustering VII A Note on Linkage 12 4 5 3 In determining clusters, linkage is a measure of one clusters similarity to another. hclust(dmat, method=c('average', 'single', 'complete'))

15 Trinity College Dublin, The University of Dublin Hierarchical Clustering VIII In hierarchical clustering, you have to determine what constitutes a cluster yourself. R has functions to help extract clusters. clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red')

16 Trinity College Dublin, The University of Dublin Hierarchical Clustering IX clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') The cutree() function returns the clusters and their members. cluster_1 <- names(which(clusters==1)) cluster_2 <- names(which(clusters==2)) Note: Clusters are labeled numerically, in this case 1 and 2 in order of size (largest to smallest).

17 Trinity College Dublin, The University of Dublin Hierarchical Clustering X X53416M83670X90908M97496X90908.1 Normal_ 27 1314320212313 Normal_ 29..... Normal_ 34........ Normal_ 40..... Normal_27Normal_29Normal_34 Normal_29744.98 Normal_341046.251222.16 Normal_28860.541434.15708.68 dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') In Four Lines of Code!

18 Trinity College Dublin, The University of Dublin Hierarchical Clustering XI Cluster genes using correlation as a distance measure. 1. Compute distance matrix tdmat <- as.dist(cor(for_cluster, method='spearman')) X53416M83670... M83670-0.3954569 X909080.1880166-0.5638244 M97496-0.25283800.7857513... X90908.10.1714325-0.7105128...

19 Trinity College Dublin, The University of Dublin Hierarchical Clustering XII Cluster genes using correlation as a distance measure. cclust <- hclust(tdmat, method='average') plot(cclust) 2. Run clustering algorithm. 3. Look at clusters cclusters <- cutree(cclust, k=2) rect.hclust(cclust, k=2, border='red')

20 Trinity College Dublin, The University of Dublin Heatmaps I Dendrograms are a way of visualising relationships in multivariate data. Heatmaps can also be used to visualise multivariate data. Heatmaps and Dendrograms can be combined to create informative visualisations.

21 Trinity College Dublin, The University of Dublin Heatmaps II Bioconductor is a repository of R packages for analysing biological data. https://www.bioconductor.org We are going to use the heatplus package in bioconductor to make heatmaps http://bioconductor.org/packages/release/bioc/html/Heatplus.html source("http://bioconductor.org/biocLite.R") biocLite("Heatplus") library(Heatplus) To Install Heatplus.

22 Trinity College Dublin, The University of Dublin Heatmaps III Documentation and examples for bioconductor packages are always on the package homepage.

23 Trinity College Dublin, The University of Dublin Heatmaps IV The Heatplus package has a function called regHeatmap() to make heatmaps. This function enables us to cluster genes and samples using any distance metric and any linkage metric. The body of the heatmaps are colour intensities which represent the original data.

24 Trinity College Dublin, The University of Dublin Heatmaps V Draw a heatmap of the first 50 genes from the unaffected gene expression data. The default approach uses Euclidean distance and complete linkage to make the dendrograms. h1 <- regHeatmap(as.matrix(unaffected[,1:50])) plot(h1)

25 Trinity College Dublin, The University of Dublin Heatmaps VI di <- function(x) dist(x, method='euclidean') cl <- function(x) hclust(x, method='average') h3 <- regHeatmap(as.matrix(unaffected[,1:50]), legend=2, dendrogram=list(clustfun=cl, distfun=di)) plot(h3) Explicitly program the heatmap function to make dendrograms using Euclidean distance and average linkage. Compared to the default, the dendrogram shape (complete linkage) changes a little but the clusters are the similar in this average linkage example.

26 Trinity College Dublin, The University of Dublin Heatmaps VII di <- function(x) as.dist(1-abs(cor(t(x), method='spearman'))) cl <- function(x) hclust(x, method='average') h4 <- regHeatmap(as.matrix(unaffected[,1:50]), legend=2, dendrogram=list(clustfun=cl, distfun=di)) plot(h4) Make a heatmap using 1 - |r| as a dissimilarity measure.

27 Trinity College Dublin, The University of Dublin Lecture 5 Problem Sheet A problem sheet entitled lecture_5_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next 30-45 mins. We will be on hand to help out. Solutions will be posted this afternoon.

28 Thank You


Download ppt "Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016."

Similar presentations


Ads by Google