Lecture 11. Microarray and RNA-seq II

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

RNA-Seq based discovery and reconstruction of unannotated transcripts

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Hierarchical Clustering, DBSCAN The EM Algorithm

Cluster analysis for microarray data Anja von Heydebreck.

Introduction to Bioinformatics

Peter Tsai Bioinformatics Institute, University of Auckland

DEG Mi-kyoung Seo.

RNA-seq: the future of transcriptomics ……. ?

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Transcriptomics Jim Noonan GENE 760.

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Introduction to Bioinformatics Algorithms Clustering.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Microarray Type Analyses using Second Generation Sequencing

Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Introduction to Bioinformatics - Tutorial no. 12

Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

Lecture 10. Microarray and RNA-seq

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

More on Microarrays Chitta Baral Arizona State University.

RNAseq analyses -- methods

Hierarchical Clustering

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Cluster validation Integration ICES Bioinformatics.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Statistical Programming Using the R Language Lecture 5 Introducing Multivariate Data Analysis Darren J. Fitzpatrick, Ph.D April 2016.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Canadian Bioinformatics Workshops

Statistics Behind Differential Gene Expression

Unsupervised Learning

RNA Quantitation from RNAseq Data

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

RNA-Seq analysis in R (Bioconductor)

Hierarchical Clustering

Hierarchical and Ensemble Clustering

Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.

Hierarchical and Ensemble Clustering

Data Mining – Chapter 4 Cluster Analysis Part 2

SEEM4630 Tutorial 3 – Clustering.

Sequence Analysis - RNA-Seq 2

Unsupervised Learning

Presentation transcript:

Lecture 11. Microarray and RNA-seq II : Clustering Gene Expression Data Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine MES7594-01 Genome Informatics I (2015 Spring)

RNA sequencing Isolate RNAs Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis Adapted from Canadian Bioinformatics Workshop

RNS-Seq analysis pipeline Bowtie2 Htseq-count Edge-R Oshlack et al. 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2. Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Normalization factor for sample k using reference sample r Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data First, trim the genes by log-fold-change (> 30%) and by absolute intensity (> 5%) to remove biological outliers (DEGs); G*: non-trimmed genes Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) If there is no gene expression change for all the genes, M == 0, and thus TMM==1. Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Observed count for gene g in library k Total number of reads for library k Log-fold-changes (sample k relative to sample r for gene g) Y should be greater than 0 Robinson & Oshlack 2010

TMM (trimmed mean of M values) normalization for RNA-seq data Weighted sum of log-fold-changes for all non-trimmed genes Weight for M (Yg high -> small w Yg low -> large w) Because of negative binomial!! (correct for elevated variance for high read counts) Sum of weights over all non-trimmed genes Robinson & Oshlack 2010

Edge-R: find DEGs from RNA-seq RNA-seq data fits a negative binomial distribution. (higher the mean counts, more the variance) Edge R estimates the genewise dispersions and shrink the dispersions towards a consensus value using empirical Bayes procedure. Differential expression is assessed for each gene using an exact test analogous to Fisher’s exact test, but adapted for overdispersed data.

Why clustering? Displaying and analyzing a large transcriptome (microarray, RNA-seq) data as a whole set is a difficult problem. It is easier to interpret the data if they are partitioned into clusters combining similar data points. Michael Eisen & David Botstein in 1998 PNAS paper first applied a clustering algorithm to analyze gene expression microarray data.

Microarray Data For clustering analysis, use normalized and log-transformed values (below).

Clustering Methods Name Type Algorithm Agglomerative bottom-up start with every element in its own cluster, and iteratively join clusters together Divisive top-down start with one cluster and iteratively divide it into smaller clusters Hierachical most famous agglomerative clustering method organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

Agglomerative vs. Divisive Illustrative Example Agglomerative and divisive clustering on the data set {a, b, c, d ,e } Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Agglomerative Divisive

Two genes (black, red) with 4 conditions Distance metric To measure similarity or dissimilarity between two genes Euclidean distance: Pearson correlation coefficient (r) - positive correlation (left) - negative correlation (right) Two genes (black, red) with 4 conditions

Cluster Distance Measures single link (min) complete link (max) average Single link: smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} Complete link: largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} Average: avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq)} Define D(C, C)=0, the distance between two exactly same clusters is ZERO! (Ke Chen, Univ Menchester COMP24111)

Cluster Distance Measures Example: Given a data set of five objects characterised by a single feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. 1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2. a b c d e Feature 1 2 4 5 6 Single link Complete link Average a b c d e 1 3 4 5 2

Agglomerative Algorithm The Agglomerative algorithm is carried out in three steps: Convert all object features into a distance matrix Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) Repeat until number of cluster is one (or known # of clusters) Merge two closest clusters Update “distance matrix”

Example: Hierachical clustering Problem: clustering analysis with agglomerative algorithm data matrix Euclidean distance distance matrix

Example: Hierachical clustering Merge two closest clusters (iteration 1)

Example: Hierachical clustering Update distance matrix (iteration 1)

Example: Hierachical clustering Merge two closest clusters (iteration 2)

Example: Hierachical clustering Update distance matrix (iteration 2)

Example: Hierachical clustering Merge two closest clusters/update distance matrix (iteration 3)

Example: Hierachical clustering

Example: Hierachical clustering Final result (meeting termination condition)

Example: Hierachical clustering Dendrogram tree representation 2 3 4 5 6 object distance In the beginning we have 6 clusters: A, B, C, D, E and F We merge clusters D and F into cluster (D, F) at distance 0.50 We merge cluster A and cluster B into (A, B) at distance 0.71 We merge clusters E and (D, F) into ((D, F), E) at distance 1.00 We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 We merge clusters (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50 The last cluster contain all the objects, thus conclude the computation For a dendrogram tree, its horizontal axis indexes all objects in a given data set, while its vertical axis expresses the lifetime of all possible cluster formation. The lifetime of a cluster (individual cluster) in the dendrogram is defined as a distance interval from the moment that the cluster is created to the moment that it disappears by merging with other clusters.

Hierachical clustering workflow Prepare normalized expression data matrix in log2 scale. Select highly variable genes. Calculate pairwise distances for the selected genes. Perform clustering. Display results by heatmap. “Never try with whole genome dataset.”