Statistical Programming Using the R Language

Statistical Programming Using the R Language
Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016

Solutions I 1 install.packages('pwr') library(pwr)
pwr.anova.test(k = 4, f=0.5, sig.level=0.05, power=0.8) Balanced one-way analysis of variance power calculation k = 4 n = f = 0.5 sig.level = 0.05 power = 0.8 NOTE: n is number in each group

Solutions II 2.2 anova(lm(Guanylin~TNP, data=df))
Analysis of Variance Table Response: Guanylin Df Sum Sq Mean Sq F value Pr(>F) TNP e-09 *** Residuals

Solutions III 2.2 pairwise.t.test(df$Guanylin, df$TNP, p.adjust.method='BH') Pairwise comparisons using t tests with pooled SD data: df$Guanylin and df$TNP N_M N_W T_M N_W T_M 3.5e T_W 8.9e P value adjustment method: BH

Solutions III 2.4 boxplot(df$Guanylin~df$TNP)

What is multivariate data?
Multivariate data is any data for which there are numerous measures/variables measured from a single sample. X53416 M83670 X90908 M97496 X Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 It is often called multidimensional data.

Clustering Clustering is a means of grouping data such that variables in the same cluster are more similar to each other than to the variables in another cluster. It is a form of unsupervised learning, i.e., the data has no category information. Use only the relationship between the data points, clustering, irrespective of the method, attempts to organise the data into groups. It is up to the researcher to decide if the clusters have any biological or other meaning by doing downstream analysis of the clusters, e.g., GO term enrichment, pathway analysis, etc.

Hierarchical Clustering I
The Algorithm For a set of N samples to be clustered and an N x N distance matrix: Assign each item to a cluster such that you have N clusters containing a single item. Using the distance matrix, merge the two most similar samples such that you have N-1 clusters, one of which contains two samples. Compute the distance between the new cluster and each of the remaining clusters and merge the most similar clusters. Repeat 2 and 3 until all items are clustered into a single cluster of size N.

Hierarchical Clustering II
df <- read.table('colon_cancer_data_set.txt', header=T) unaffected <- df[which(df$Status=='U'), 1:7464] for_cluster <- unaffected[, 1:5] # Example for 5 genes X53416 M83670 X90908 M97496 X Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40

Hierarchical Clustering III
To perform clustering, we first need to compute a distance matrix. dmat <- dist(for_cluster, method='euclidean') X53416 M83670 X90908 M97496 X Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 Distance Matrix (dmat) Normal_27 Normal_29 Normal_34 744.98 Normal_28 860.54 708.68 Original Data

Hierarchical Clustering IV
A Note on Distance Measures Distance metric are a way of summarising the similarity between multiple observations. There are numerous formulae for computing such differences but the most commonly used is the Euclidean distance. For the other methods, look up the help for the dist()function. Euclidean distance for 2 dimensional data (x, y) is just the distance between two points on a line.

Hierarchical Clustering V
A Note on Distance Measures Euclidean distance in multivariate data is the generalised form of the 2D example below. Distance metrics produce a single measure of similarity between samples based on multiple measurements. They produce a symmetrical (N x N) distance matrix. X53416 M83670 X90908 M97496 X Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 Normal_27 Normal_29 Normal_34 744.98 Normal_28 860.54 708.68

Hierarchical Clustering VI
Next, we cluster the data. dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) Normal_4 appears to be an outlier. At the very least, he/she is different.

Hierarchical Clustering VII
A Note on Linkage 1 2 3 5 4 In determining clusters, linkage is a measure of one clusters similarity to another. hclust(dmat, method=c('average', 'single', 'complete'))

Hierarchical Clustering VIII
In hierarchical clustering, you have to determine what constitutes a cluster yourself. R has functions to help extract clusters. clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red')

Hierarchical Clustering IX
clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') The cutree() function returns the clusters and their members. cluster_1 <- names(which(clusters==1)) cluster_2 <- names(which(clusters==2)) Note: Clusters are labeled numerically, in this case 1 and 2 in order of size (largest to smallest).

Hierarchical Clustering X
In Four Lines of Code! X53416 M83670 X90908 M97496 X Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') Normal_27 Normal_29 Normal_34 744.98 Normal_28 860.54 708.68

Hierarchical Clustering XI
Cluster genes using correlation as a distance measure. 1. Compute distance matrix tdmat <- as.dist(cor(for_cluster, method='spearman')) X53416 M83670 ... X90908 M97496 X

Hierarchical Clustering XII
Cluster genes using correlation as a distance measure. 2. Run clustering algorithm. cclust <- hclust(tdmat, method='average') plot(cclust) 3. Look at clusters cclusters <- cutree(cclust, k=2) rect.hclust(cclust, k=2, border='red')

Writing Your Own Functions I
We have seen how to use functions in R – you can also write your own functions. Use the inbuilt function() command to define custom functions. my_func <- function(x) print(x + 1) > my_func(1) [1] 2 > my_func(2) [1] 3

Writing Your Own Functions II
my_func <- function(x){ if (is.numeric(x)){ sqr <- x^2 sqroot <- sqrt(x) holder <- c(sqr, sqroot) return(holder) }else{ print(“x is not numeric”) } What do you think this function is doing?

Heatmaps I Dendrograms are a way of visualising relationships in multivariate data. Heatmaps can also be used to visualise multivariate data. Heatmaps and Dendrograms can be combined to create informative visualisations.

Heatmaps II Bioconductor is a repository of R packages for analysing biological data. We are going to use the heatplus package in bioconductor to make heatmaps To Install Heatplus. source(" biocLite("Heatplus") library(Heatplus)

Heatmaps III Documentation and examples for bioconductor packages are always on the package homepage.

Heatmaps IV The Heatplus package has a function called regHeatmap() to make heatmaps. This function enables us to cluster genes and samples using any distance metric and any linkage metric. The body of the heatmaps are colour intensities which represent the original data.

Heatmaps V Draw a heatmap of the first 50 genes from the unaffected gene expression data. The default approach uses Euclidean distance and complete linkage to make the dendrograms. h1 <- regHeatmap(as.matrix(unaffected[,1:50])) plot(h1)

Heatmaps VI The default setting is to scale the data by row. This behaviour can be changed through the 'scale' parameter: h1 <- regHeatmap(as.matrix(unaffected[,1:50]), scale='none') plot(h1)

Heatmaps VII To save plots, select the Export option in the plotting window and save in your preferred format.

Heatmaps VIII Explicitly program the heatmap function to make dendrograms using Euclidean distance and average linkage. Compared to the default, the dendrogram shape (complete linkage) changes a little but the clusters are the similar in this average linkage example. di <- function(x) dist(x, method='euclidean') cl <- function(x) hclust(x, method='average') h3 <- regHeatmap(as.matrix(unaffected[,1:50]), dendrogram=list(clustfun=cl, distfun=di)) plot(h3)

Heatmaps IX Make a heatmap using 1 - |r| as a dissimilarity measure.
Note the t() function – this is a transposition di <- function(x) as.dist(1-abs(cor(t(x), method='spearman'))) cl <- function(x) hclust(x, method='average') h4 <- regHeatmap(as.matrix(unaffected[,1:50]), dendrogram=list(clustfun=cl, distfun=di)) plot(h4)

Lecture 5 Problem Sheet A problem sheet entitled lecture_5_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next mins. We will be on hand to help out. Solutions will be posted this afternoon.

Thank You

Statistical Programming Using the R Language

Similar presentations

Presentation on theme: "Statistical Programming Using the R Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Programming Using the R Language

Similar presentations

Presentation on theme: "Statistical Programming Using the R Language"— Presentation transcript:

Similar presentations

About project

Feedback