Download presentation
1
Lecture 11. Microarray and RNA-seq II
: Clustering Gene Expression Data Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine MES Genome Informatics I (2015 Spring)
2
RNA sequencing Isolate RNAs
Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis Adapted from Canadian Bioinformatics Workshop
3
RNS-Seq analysis pipeline
Bowtie2 Htseq-count Edge-R Oshlack et al. 2010
4
TMM (trimmed mean of M values) normalization for RNA-seq data
Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2. Robinson & Oshlack 2010
5
TMM (trimmed mean of M values) normalization for RNA-seq data
Normalization factor for sample k using reference sample r Robinson & Oshlack 2010
6
TMM (trimmed mean of M values) normalization for RNA-seq data
Robinson & Oshlack 2010
7
TMM (trimmed mean of M values) normalization for RNA-seq data
First, trim the genes by log-fold-change (> 30%) and by absolute intensity (> 5%) to remove biological outliers (DEGs); G*: non-trimmed genes Robinson & Oshlack 2010
8
TMM (trimmed mean of M values) normalization for RNA-seq data
Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Robinson & Oshlack 2010
9
TMM (trimmed mean of M values) normalization for RNA-seq data
Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) If there is no gene expression change for all the genes, M == 0, and thus TMM==1. Robinson & Oshlack 2010
10
TMM (trimmed mean of M values) normalization for RNA-seq data
Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Y should be greater than 0 Robinson & Oshlack 2010
11
TMM (trimmed mean of M values) normalization for RNA-seq data
Weighted sum of log-fold-changes for all non-trimmed genes Weight for M (Yg high -> small w Yg low -> large w) Because of negative binomial!! (correct for elevated variance for high read counts) Sum of weights over all non-trimmed genes Robinson & Oshlack 2010
12
Edge-R: find DEGs from RNA-seq
RNA-seq data fits a negative binomial distribution. (higher the mean counts, more the variance) Edge R estimates the genewise dispersions and shrink the dispersions towards a consensus value using empirical Bayes procedure. Differential expression is assessed for each gene using an exact test analogous to Fisher’s exact test, but adapted for overdispersed data.
13
Why clustering? Displaying and analyzing a large transcriptome (microarray, RNA-seq) data as a whole set is a difficult problem. It is easier to interpret the data if they are partitioned into clusters combining similar data points. Michael Eisen & David Botstein in 1998 PNAS paper first applied a clustering algorithm to analyze gene expression microarray data.
14
Microarray Data For clustering analysis, use normalized and log-transformed values (below).
15
Clustering Methods Name Type Algorithm Agglomerative bottom-up
start with every element in its own cluster, and iteratively join clusters together Divisive top-down start with one cluster and iteratively divide it into smaller clusters Hierachical most famous agglomerative clustering method organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees
16
Agglomerative vs. Divisive
Illustrative Example Agglomerative and divisive clustering on the data set {a, b, c, d ,e } Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Agglomerative Divisive
17
Two genes (black, red) with 4 conditions
Distance metric To measure similarity or dissimilarity between two genes Euclidean distance: Pearson correlation coefficient (r) - positive correlation (left) - negative correlation (right) Two genes (black, red) with 4 conditions
18
Cluster Distance Measures
single link (min) complete link (max) average Single link: smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} Complete link: largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} Average: avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq)} Define D(C, C)=0, the distance between two exactly same clusters is ZERO! (Ke Chen, Univ Menchester COMP24111)
19
Cluster Distance Measures
Example: Given a data set of five objects characterised by a single feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. 1. Calculate the distance matrix Calculate three cluster distances between C1 and C2. a b c d e Feature 1 2 4 5 6 Single link Complete link Average a b c d e 1 3 4 5 2
20
Agglomerative Algorithm
The Agglomerative algorithm is carried out in three steps: Convert all object features into a distance matrix Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) Repeat until number of cluster is one (or known # of clusters) Merge two closest clusters Update “distance matrix”
21
Example: Hierachical clustering
Problem: clustering analysis with agglomerative algorithm data matrix Euclidean distance distance matrix
22
Example: Hierachical clustering
Merge two closest clusters (iteration 1)
23
Example: Hierachical clustering
Update distance matrix (iteration 1)
24
Example: Hierachical clustering
Merge two closest clusters (iteration 2)
25
Example: Hierachical clustering
Update distance matrix (iteration 2)
26
Example: Hierachical clustering
Merge two closest clusters/update distance matrix (iteration 3)
27
Example: Hierachical clustering
28
Example: Hierachical clustering
Final result (meeting termination condition)
29
Example: Hierachical clustering
Dendrogram tree representation 2 3 4 5 6 object distance In the beginning we have 6 clusters: A, B, C, D, E and F We merge clusters D and F into cluster (D, F) at distance 0.50 We merge cluster A and cluster B into (A, B) at distance 0.71 We merge clusters E and (D, F) into ((D, F), E) at distance 1.00 We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 We merge clusters (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50 The last cluster contain all the objects, thus conclude the computation For a dendrogram tree, its horizontal axis indexes all objects in a given data set, while its vertical axis expresses the lifetime of all possible cluster formation. The lifetime of a cluster (individual cluster) in the dendrogram is defined as a distance interval from the moment that the cluster is created to the moment that it disappears by merging with other clusters.
30
Hierachical clustering workflow
Prepare normalized expression data matrix in log2 scale. Select highly variable genes. Calculate pairwise distances for the selected genes. Perform clustering. Display results by heatmap. “Never try with whole genome dataset.”
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.