1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR <1%, 3158 genes were DE at FDR <5% There were 3,582 genes with an.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
More on Microarrays Chitta Baral Arizona State University.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Microarrays.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Gene expression analysis
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Hierarchical Clustering
1 Limma homework Is it possible that some of these gene expression changes are miscalled (i.e. biologically significant but insignificant p value and vice.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Cluster Analysis of Gene Expression Profiles
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Self-organizing map numeric vectors and sequence motifs
Hierarchical Clustering
Presentation transcript:

1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR <1%, 3158 genes were DE at FDR <5% There were 3,582 genes with an average fold-change >2X (1.0 in log 2 space) 2,669 (63%) BUT HS + EtOH analysis (added 2 replicates of a new conditions): Only 1618 genes were DE (at any of the models) at FDR of 5% ??? Why so few when 3157 met this cutoff when HS was analyzed alone? baySeq paper: harder to call DE with “more complex” models

How well did baySeq do on the HS only analysis? 3158 genes FDR <0.05 (10K it on prior calc) HS log2 fold-change rep1 HS log2 fold-change rep2

3 How well did baySeq do on the HS only analysis? HS log2 fold-change rep1 HS log2 fold-change rep2 902 genes FDR >5% but fold-change >1.5X in both replicates ~50% of these: low counts Many of remaining missed due to day-to-day variation that is not accounted for without pairing the data

How well did baySeq do on the HS + EtOH analysis? 1618 genes FDR <0.05 to at least one DE model Models: NDE = 1,1,1,1,1,1 DEH = 1,1,2,2,1,1 DEE = 1,1,1,1,2,2 DEHE = 1,1,2,2,2,2 DEHE2 = 1,1,2,2,3,3

5 How well did baySeq do on the HS only analysis? But, 1391 genes with FDR > 0.05 to all DE models but at least 1.5X expression change in all 4 samples Why weren’t these identified as DE? 218 of these genes were DE when HS was analyzed ALONE.

6 Assessing sensitivity (with VLOOKUP in Excel) There were 64 known Hsf1 targets *with data* on the file. My run identified 38 of those at an FDR of /64  59.4% sensitivity 45 were identified at FDR of 0.05% 45/64  70% sensitivity

7 Gene X: X 1 X 2 X 3 Array 1Array 2Array 3 x coordinate y coordinate z coordinate LAST TIME:

8 4. Centroid linkage clustering ‘ centroid ’ (average vector) LAST TIME:

9 Gene X: X 1 X 2 X 3 X 4 X 5 Array 1Array 2Array 3Array 4Array 5 Gene Y: Y 1 Y 2 Y 3 Y 4 Y 5 Sometimes, want to use the weighted pearson correlation For example: if these arrays are identical, the data are over-represented 3X  (X i ) (Y i ) N S x,y =  i = 1 N XiXi  N 2  N YiYi  N 2  N

10  (X i ) (Y i ) wiwi S x,y =  i = 1 N Gene X: X 1 X 2 X 3 X 4 X 5 Array 1Array 2Array 3Array 4Array 5 Gene Y: Y 1 Y 2 Y 3 Y 4 Y 5 Sometimes, want to use the weighted pearson correlation For example: if these arrays are identical, the data are over-represented 3X -- can weight experiments i = 3,4,5 by w = 0.33 wiwi  Where w i = 1 L i k = array corr. cutoff d = Pearson distance (= 1 - P. corr) n = exponent (usually 1) XiXi  i = 1 N 2  N YiYi  N 2  N

11 Unweighted Pearson correlationWeighted Pearson correlation

12 Unweighted Pearson correlationWeighted Pearson correlation

13 Alizadeh et al Can also cluster array experiments based on global similarity in expression

14 A B C D F E Hierarchical trees of gene expression data are analogous to phylogenetic trees Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way.

15 A B C D F E Hierarchical trees of gene expression data are analogous to phylogenetic trees Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way. C F E D A B

16 Genes involved in same cellular process are often coregulated These genes may not have the same annotation, but still function together and are thus co-expressed

17 M choose i = # of possible groups of size i composed of the objects M = M ! (M-i)! * i !

18 Advantages and Disadvantages of Hierarchical clustering Advantages: 1) Straightforward 2) Captures biological information relatively well Disadvantages: 1) Doesn ’ t give discrete clusters … need to define clusters with cutoffs 2) Hierarchical arrangement does not always represent data appropriately -- sometimes a hierarchy is not appropriate: genes can belong only to one cluster. 3) Get different clustering for different experiment sets THERE IS NO ONE PERFECT CLUSTERING METHOD

19 Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable k-means clustering

20 Centroids Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable k-means clustering

21 Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable k-means clustering What are the disadvantages of k-means clustering?

22 Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable k-means clustering What are the disadvantages of k-means clustering? - Need to know how many clusters to ask for (can define this empirically) - Genes are not organized within each cluster (can hierarchically cluster genes afterwards or use SOM analysis) - Random process makes this an indeterminate method