Statistical Programming Using the R Language

Slides:

Advertisements

Similar presentations

Advertisements

BioInformatics (3).

Discrimination and Classification. Discrimination Situation: We have two or more populations  1,  2, etc (possibly p-variate normal). The populations.

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

Hierarchical Clustering

Measures of Variation Sample range Sample variance Sample standard deviation Sample interquartile range.

Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.

Introduction to Bioinformatics

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Mutual Information Mathematical Biology Seminar

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 3 Chicago School of Professional Psychology.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

What is Cluster Analysis?

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Segmentation Analysis

Gene expression profiling identifies molecular subtypes of gliomas

Introduction to SAS Essentials Mastering SAS for Data Analytics

Lecture 11. Microarray and RNA-seq II

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Statistical Programming Using the R Language Lecture 3 Hypothesis Testing Darren J. Fitzpatrick, Ph.D April 2016.

Statistical Programming Using the R Language Lecture 4 Experimental Design & ANOVA Darren J. Fitzpatrick, Ph.D April 2016.

Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.

Statistical Programming Using the R Language Lecture 2 Basic Concepts II Darren J. Fitzpatrick, Ph.D April 2016.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Statistical Programming Using the R Language

Unsupervised Learning

Statistical Programming Using the R Language

Statistical Programming Using the R Language

PREDICT 422: Practical Machine Learning

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Clustering Manpreet S. Katari.

Statistical Programming Using the R Language

Data Mining K-means Algorithm

Machine Learning and Data Mining Clustering

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Discrimination and Classification

TM 720: Statistical Process Control

K-means and Hierarchical Clustering

Quality Control at a Local Brewery

Hierarchical clustering approaches for high-throughput data

Clustering and Multidimensional Scaling

Multivariate Statistical Methods

(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.

Cluster Analysis.

SEEM4630 Tutorial 3 – Clustering.

Evaluation of Clustering Techniques on DMOZ Data

Hierarchical Clustering

Unsupervised Learning

Presentation transcript:

Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016

Solutions I 1 install.packages('pwr') library(pwr) pwr.anova.test(k = 4, f=0.5, sig.level=0.05, power=0.8) Balanced one-way analysis of variance power calculation k = 4 n = 11.92613 f = 0.5 sig.level = 0.05 power = 0.8 NOTE: n is number in each group

Solutions II 2.2 anova(lm(Guanylin~TNP, data=df)) Analysis of Variance Table Response: Guanylin Df Sum Sq Mean Sq F value Pr(>F) TNP 3 10618418 3539473 31.43 1.164e-09 *** Residuals 32 3603655 112614

Solutions III 2.2 pairwise.t.test(df$Guanylin, df$TNP, p.adjust.method='BH') Pairwise comparisons using t tests with pooled SD data: df$Guanylin and df$TNP N_M N_W T_M N_W 0.00484 - - T_M 3.5e-09 0.00066 - T_W 8.9e-08 0.00155 0.88366 P value adjustment method: BH

Solutions III 2.4 boxplot(df$Guanylin~df$TNP)

What is multivariate data? Multivariate data is any data for which there are numerous measures/variables measured from a single sample. X53416 M83670 X90908 M97496 X90908.1 Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 It is often called multidimensional data.

Clustering Clustering is a means of grouping data such that variables in the same cluster are more similar to each other than to the variables in another cluster. It is a form of unsupervised learning, i.e., the data has no category information. Use only the relationship between the data points, clustering, irrespective of the method, attempts to organise the data into groups. It is up to the researcher to decide if the clusters have any biological or other meaning by doing downstream analysis of the clusters, e.g., GO term enrichment, pathway analysis, etc.

Hierarchical Clustering I The Algorithm For a set of N samples to be clustered and an N x N distance matrix: Assign each item to a cluster such that you have N clusters containing a single item. Using the distance matrix, merge the two most similar samples such that you have N-1 clusters, one of which contains two samples. Compute the distance between the new cluster and each of the remaining clusters and merge the most similar clusters. Repeat 2 and 3 until all items are clustered into a single cluster of size N.

Hierarchical Clustering II df <- read.table('colon_cancer_data_set.txt', header=T) unaffected <- df[which(df$Status=='U'), 1:7464] for_cluster <- unaffected[, 1:5] # Example for 5 genes X53416 M83670 X90908 M97496 X90908.1 Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40

Hierarchical Clustering III To perform clustering, we first need to compute a distance matrix. dmat <- dist(for_cluster, method='euclidean') X53416 M83670 X90908 M97496 X90908.1 Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 Distance Matrix (dmat) Normal_27 Normal_29 Normal_34 744.98 1046.25 1222.16 Normal_28 860.54 1434.15 708.68 Original Data

Hierarchical Clustering IV A Note on Distance Measures Distance metric are a way of summarising the similarity between multiple observations. There are numerous formulae for computing such differences but the most commonly used is the Euclidean distance. For the other methods, look up the help for the dist()function. Euclidean distance for 2 dimensional data (x, y) is just the distance between two points on a line.

Hierarchical Clustering V A Note on Distance Measures Euclidean distance in multivariate data is the generalised form of the 2D example below. Distance metrics produce a single measure of similarity between samples based on multiple measurements. They produce a symmetrical (N x N) distance matrix. X53416 M83670 X90908 M97496 X90908.1 Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 Normal_27 Normal_29 Normal_34 744.98 1046.25 1222.16 Normal_28 860.54 1434.15 708.68

Hierarchical Clustering VI Next, we cluster the data. dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) Normal_4 appears to be an outlier. At the very least, he/she is different.

Hierarchical Clustering VII A Note on Linkage 1 2 3 5 4 In determining clusters, linkage is a measure of one clusters similarity to another. hclust(dmat, method=c('average', 'single', 'complete'))

Hierarchical Clustering VIII In hierarchical clustering, you have to determine what constitutes a cluster yourself. R has functions to help extract clusters. clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red')

Hierarchical Clustering IX clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') The cutree() function returns the clusters and their members. cluster_1 <- names(which(clusters==1)) cluster_2 <- names(which(clusters==2)) Note: Clusters are labeled numerically, in this case 1 and 2 in order of size (largest to smallest).

Hierarchical Clustering X In Four Lines of Code! X53416 M83670 X90908 M97496 X90908.1 Normal_27 1314 320 2 1231 3 Normal_29 . Normal_34 ... Normal_40 dmat <- dist(for_cluster, method='euclidean') dclust <- hclust(dmat, method='average') plot(dclust) clusters <- cutree(dclust, k=2) rect.hclust(dclust, k=2, border='red') Normal_27 Normal_29 Normal_34 744.98 1046.25 1222.16 Normal_28 860.54 1434.15 708.68

Hierarchical Clustering XI Cluster genes using correlation as a distance measure. 1. Compute distance matrix tdmat <- as.dist(cor(for_cluster, method='spearman')) X53416 M83670 ... -0.3954569 X90908 0.1880166 -0.5638244 M97496 -0.2528380 0.7857513 X90908.1 0.1714325 -0.7105128

Hierarchical Clustering XII Cluster genes using correlation as a distance measure. 2. Run clustering algorithm. cclust <- hclust(tdmat, method='average') plot(cclust) 3. Look at clusters cclusters <- cutree(cclust, k=2) rect.hclust(cclust, k=2, border='red')

Writing Your Own Functions I We have seen how to use functions in R – you can also write your own functions. Use the inbuilt function() command to define custom functions. my_func <- function(x) print(x + 1) > my_func(1) [1] 2 > my_func(2) [1] 3

Writing Your Own Functions II my_func <- function(x){ if (is.numeric(x)){ sqr <- x^2 sqroot <- sqrt(x) holder <- c(sqr, sqroot) return(holder) }else{ print(“x is not numeric”) } What do you think this function is doing?

Heatmaps I Dendrograms are a way of visualising relationships in multivariate data. Heatmaps can also be used to visualise multivariate data. Heatmaps and Dendrograms can be combined to create informative visualisations.

Heatmaps II Bioconductor is a repository of R packages for analysing biological data. https://www.bioconductor.org We are going to use the heatplus package in bioconductor to make heatmaps http://bioconductor.org/packages/release/bioc/html/Heatplus.html To Install Heatplus. source("http://bioconductor.org/biocLite.R") biocLite("Heatplus") library(Heatplus)

Heatmaps III Documentation and examples for bioconductor packages are always on the package homepage.

Heatmaps IV The Heatplus package has a function called regHeatmap() to make heatmaps. This function enables us to cluster genes and samples using any distance metric and any linkage metric. The body of the heatmaps are colour intensities which represent the original data.

Heatmaps V Draw a heatmap of the first 50 genes from the unaffected gene expression data. The default approach uses Euclidean distance and complete linkage to make the dendrograms. h1 <- regHeatmap(as.matrix(unaffected[,1:50])) plot(h1)

Heatmaps VI The default setting is to scale the data by row. This behaviour can be changed through the 'scale' parameter: h1 <- regHeatmap(as.matrix(unaffected[,1:50]), scale='none') plot(h1)

Heatmaps VII To save plots, select the Export option in the plotting window and save in your preferred format.

Heatmaps VIII Explicitly program the heatmap function to make dendrograms using Euclidean distance and average linkage. Compared to the default, the dendrogram shape (complete linkage) changes a little but the clusters are the similar in this average linkage example. di <- function(x) dist(x, method='euclidean') cl <- function(x) hclust(x, method='average') h3 <- regHeatmap(as.matrix(unaffected[,1:50]), dendrogram=list(clustfun=cl, distfun=di)) plot(h3)

Heatmaps IX Make a heatmap using 1 - |r| as a dissimilarity measure. Note the t() function – this is a transposition di <- function(x) as.dist(1-abs(cor(t(x), method='spearman'))) cl <- function(x) hclust(x, method='average') h4 <- regHeatmap(as.matrix(unaffected[,1:50]), dendrogram=list(clustfun=cl, distfun=di)) plot(h4)

Lecture 5 Problem Sheet A problem sheet entitled lecture_5_problems.pdf is located on the course website. Some of the code required for the problem sheet has been covered in this lecture. Consult the help pages if unsure how to use a function. Please attempt the problems for the next 30-45 mins. We will be on hand to help out. Solutions will be posted this afternoon.

Thank You