Lecture 11. Microarray and RNA-seq II : Clustering Gene Expression Data Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine MES7594-01 Genome Informatics I (2015 Spring)
RNA sequencing Isolate RNAs Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis Adapted from Canadian Bioinformatics Workshop
RNS-Seq analysis pipeline Bowtie2 Htseq-count Edge-R Oshlack et al. 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2. Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Normalization factor for sample k using reference sample r Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data First, trim the genes by log-fold-change (> 30%) and by absolute intensity (> 5%) to remove biological outliers (DEGs); G*: non-trimmed genes Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) If there is no gene expression change for all the genes, M == 0, and thus TMM==1. Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Y should be greater than 0 Robinson & Oshlack 2010
TMM (trimmed mean of M values) normalization for RNA-seq data Weighted sum of log-fold-changes for all non-trimmed genes Weight for M (Yg high -> small w Yg low -> large w) Because of negative binomial!! (correct for elevated variance for high read counts) Sum of weights over all non-trimmed genes Robinson & Oshlack 2010
Edge-R: find DEGs from RNA-seq RNA-seq data fits a negative binomial distribution. (higher the mean counts, more the variance) Edge R estimates the genewise dispersions and shrink the dispersions towards a consensus value using empirical Bayes procedure. Differential expression is assessed for each gene using an exact test analogous to Fisher’s exact test, but adapted for overdispersed data.
Why clustering? Displaying and analyzing a large transcriptome (microarray, RNA-seq) data as a whole set is a difficult problem. It is easier to interpret the data if they are partitioned into clusters combining similar data points. Michael Eisen & David Botstein in 1998 PNAS paper first applied a clustering algorithm to analyze gene expression microarray data.
Microarray Data For clustering analysis, use normalized and log-transformed values (below).
Clustering Methods Name Type Algorithm Agglomerative bottom-up start with every element in its own cluster, and iteratively join clusters together Divisive top-down start with one cluster and iteratively divide it into smaller clusters Hierachical most famous agglomerative clustering method organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees
Agglomerative vs. Divisive Illustrative Example Agglomerative and divisive clustering on the data set {a, b, c, d ,e } Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Agglomerative Divisive
Two genes (black, red) with 4 conditions Distance metric To measure similarity or dissimilarity between two genes Euclidean distance: Pearson correlation coefficient (r) - positive correlation (left) - negative correlation (right) Two genes (black, red) with 4 conditions
Cluster Distance Measures single link (min) complete link (max) average Single link: smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} Complete link: largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} Average: avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq)} Define D(C, C)=0, the distance between two exactly same clusters is ZERO! (Ke Chen, Univ Menchester COMP24111)
Cluster Distance Measures Example: Given a data set of five objects characterised by a single feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. 1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2. a b c d e Feature 1 2 4 5 6 Single link Complete link Average a b c d e 1 3 4 5 2
Agglomerative Algorithm The Agglomerative algorithm is carried out in three steps: Convert all object features into a distance matrix Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) Repeat until number of cluster is one (or known # of clusters) Merge two closest clusters Update “distance matrix”
Example: Hierachical clustering Problem: clustering analysis with agglomerative algorithm data matrix Euclidean distance distance matrix
Example: Hierachical clustering Merge two closest clusters (iteration 1)
Example: Hierachical clustering Update distance matrix (iteration 1)
Example: Hierachical clustering Merge two closest clusters (iteration 2)
Example: Hierachical clustering Update distance matrix (iteration 2)
Example: Hierachical clustering Merge two closest clusters/update distance matrix (iteration 3)
Example: Hierachical clustering
Example: Hierachical clustering Final result (meeting termination condition)
Example: Hierachical clustering Dendrogram tree representation 2 3 4 5 6 object distance In the beginning we have 6 clusters: A, B, C, D, E and F We merge clusters D and F into cluster (D, F) at distance 0.50 We merge cluster A and cluster B into (A, B) at distance 0.71 We merge clusters E and (D, F) into ((D, F), E) at distance 1.00 We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 We merge clusters (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50 The last cluster contain all the objects, thus conclude the computation For a dendrogram tree, its horizontal axis indexes all objects in a given data set, while its vertical axis expresses the lifetime of all possible cluster formation. The lifetime of a cluster (individual cluster) in the dendrogram is defined as a distance interval from the moment that the cluster is created to the moment that it disappears by merging with other clusters.
Hierachical clustering workflow Prepare normalized expression data matrix in log2 scale. Select highly variable genes. Calculate pairwise distances for the selected genes. Perform clustering. Display results by heatmap. “Never try with whole genome dataset.”