Advanced Methods of Data Analysis 9:00 - 10:00CTWC 10:00 - 11:00 CTWC exercise 11:00 – 11:30 Break 11:30 - 12:00 SPIN 12:00 - 13:00 SPIN exercise Course.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
The Broad Institute of MIT and Harvard Clustering.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering II.
Mutual Information Mathematical Biology Seminar
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Introduction to Bioinformatics Algorithms Clustering.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
DNA Arrays …DNA systematically arrayed at high density, –virtual genomes for expression studies, RNA hybridization to DNA for expression studies, –comparative.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Introduction to Bioinformatics - Tutorial no. 12
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
I=1,2,...N data points = vertices of graph neighbors i,j connected by edges J i,j – weight associated with edge i,j J 5,8 J i,j depends on distance.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
DNA Microarrays and DNA chips resources on the web
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Mar 2002 (GG)1 Clustering Gene Expression Data Gene Expression Data Clustering of Genes and Conditions Methods –Agglomerative Hierarchical: Average Linkage.
More on Microarrays Chitta Baral Arizona State University.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
TEMPLATE DESIGN © Molecular Re-Classification of Renal Disease Using Approximate Graph Matching, Clustering and Pattern.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Cluster validation Integration ICES Bioinformatics.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Computational Biology
Unsupervised Learning
Cluster Analysis of Gene Expression Profiles
Cluster Analysis II 10/03/2012.
Hierarchical clustering approaches for high-throughput data
Clustering.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

Advanced Methods of Data Analysis 9: :00CTWC 10: :00 CTWC exercise 11:00 – 11:30 Break 11: :00 SPIN 12: :00 SPIN exercise Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel

Coupled Two-Way Clustering CTWC Gad Getz, Erel Levine, and Eytan Domany Coupled two-way clustering analysis of gene microarray data PNAS 97: Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel

Talk Aim Guide how to use the CTWC server to properly analyze micro-array data.

Motivation Micro-array experiments generate millions of numbers containing a lot of biological information. The problem: Very complicated data contain large amount of noise. How to unravel the biological information which is masked by a mess of irrelevant information. CTWC is a simple heuristic clustering procedure that was developed especially to cope with micro-array data.

Talk Outline Preprocessing and filtering Clustering of Genes and Conditions Super-Paramagnetic Clustering (SPC) Coupled Two-Way Clustering (CTWC) CTWC server Exercise

Gene Expression Matrix – CTWC format Sample3Sample2Sample1Name DB_NAME E 13 E 12 E 11 Gene1Acc1 E 23 E 22 E 21 Gene2Acc2 E 33 E 32 E 31 Gene3Acc3 The DB_NAME is used to link genes to a database

Visualization of Expression Matrix Column = chip (=sample) Row = probeset Color = expression level genes samples

Preprocessing Initial Expression Matrix genes samples 1.Select variable genes 2.Standardize

Preprocessing 1000 probesets with highest standard deviation genes samples 1.Select variable genes 2.Standardize

Preprocessing genes samples 1.Select variable genes 2.Standardize 1000 probesets with highest standard deviation, standardized

Talk Outline Preprocessing and filtering Clustering of Genes and Conditions Super-Paramagnetic Clustering (SPC) Coupled Two-Way Clustering (CTWC) CTWC server Exercise

What questions can we ask? Which genes are expressed differently in two known types of samples? What is the minimal set of genes needed to distinguish one type of samples from the others? Which genes behave similarly in the experiments? How many different types of samples are there? Supervised Methods Hypothesis Testing (use predefined labels) Supervised Methods Hypothesis Testing (use predefined labels) Unsupervised Methods Exploratory Analysis (use only the data)

All genes Filtering Clustering samples genes Clustering – unsupervised analysis Low variation genes High variation genes 3 clusters, each contains highly correlated genes

Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and might be co-regulated. Learn on the biology, infer function Goal B: Divide conditions to groups with similar gene expression profiles. Examples: Find sub-types of a disease, group or drugs according to their effect Unsupervised Analysis Clustering Methods

Giraffe DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION) How many clusters we have ? The answer depends on the resolution

Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?

Clustering problem definition Input: N data points, X i, i=1,2,…,N in a D dimensional space. Goal: Find “natural” groups (clusters) of points. Points that belong to the same cluster – are “more similar”

Clustering is not well defined Similarity: which points should be considered close? Clustering method: –Resolution: specify/hierarchical results –Shape of clusters: general, spherical.

Agglomerative Hierarchical Clustering Distance between joined clusters Dendrogram

Need to define the distance between the new cluster and the other clusters. Single Linkage: Distance between closest pair. Complete Linkage: Distance between farthest pair. Average Linkage: Average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: Distance between closest pair. Complete Linkage: Distance between farthest pair. Average Linkage: Average distance between all pairs or distance between cluster centers Single Linkage Average Linkage Conclusion: The clustering result depends on the method we are using

Agglomerative Hierarchical Clustering Results depend on distance update method –Single Linkage: elongated clusters –Average Linkage: sphere-like clusters Greedy iterative process NOT robust against noise Not always finds the “natural” clusters.

Stop … think We want to identify the real (“natural”) clusters. We should have a reliability parameter that will help us to distinguish between significant and non-significant clusters.

Talk Outline Preprocessing and filtering Clustering of Genes and Conditions Super-Paramagnetic Clustering (SPC) Coupled Two-Way Clustering (CTWC) CTWC server Exercise

Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation The idea behind SPC is based on the physical properties of dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=Low Small elements, Spins

The idea behind SPC is based on the physical properties of dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=High Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

The idea behind SPC is based on the physical properties of dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=Intermediate

T=High Phases of the Inhomogeneous Potts Ferromagnet T=Low T=Intermediate Ferro Para Super-Para

Super-Paramagnetic Clustering (SPC) T=Low T=High T=Low T=Intermediate

The algorithm simulates the magnets behavior at a range of temperatures and decides which interactions to break. The temperature (T) controls the resolution Super-Paramagnetic Clustering (SPC) Example: N=4800 points in D=2

Identify the stable clusters T=16

Same data - Average Linkage

Advantages of SPC Scans all resolutions (T) Robust against noise and initialization - calculates collective correlations. Identifies “natural” and stable clusters (  T) No need to pre-specify number of clusters Clusters can be any shape

Inside SPC: dendrogam and stable clusters T Min Cluster Size: 3 Stable Delta T: 14 Ignore dropout: 1

GenesSamples CTWC server - Setting the SPC parameters

Talk Outline Preprocessing and filtering Clustering of Genes and Conditions Super-Paramagnetic Clustering (SPC) Coupled Two-Way Clustering (CTWC) CTWC server Exercise

Back to gene expression data 2 Goals: Cluster Genes and Conditions 2 independent clustering: –Genes represented as vectors of expression in all conditions –Conditions are represented as vectors of expression of all genes

1. Identify tissue classes (tumor/normal) First clustering - Experiments D = 2000

2. Find Differentiating And Correlated Genes Second Clustering - Genes D = 62 genes samples

Two-way clustering S1(G1) G1(S1) TWO-WAY CLUSTERING:

TWO-WAY CLUSTERING: Two way clustering-ordered S1(G1) G1(S1)

Football Song A Song B

Coupled Two-Way Clustering (CTWC) G. Getz, E. Levine and E. Domany (2000) PNAS Philosophy: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets. CTWC is a heuristic to solve this problem.

Inside CTWC: Iterations DepthGenesSamples InitG1S1 1G1(S1)G2,G3,…G5S1(G1)S2,S3 2G1(S2) G1(S3) G6,G7,….G13 G14,…G21 S1(G2) … S1(G5) S4,S5,S6 S10,S11 None 3G2(S1)…G2(S3) … G5(S1)…G5(S3) G22… … …G97 S2(G1)…S2(G5) S3(G1)…S3(G5) S12,… …S51 4G1(S4) … G1(S11) G98,..G105 … G151,..G160 S1(G6) … S1(G21) S52,... S67 5G2(S4)...G2(S11) … G5(S4)...G5(S11) G161… … …G216 S2(G6)...S2(G21) S3(G6)…S3(G21) S68… …S113 Two-way clustering

notification CTWC server - Setting the coupled two-way clustering parameters

COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES tissues 1 G4 G12 S1(G4) S1(G12)

COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES CTWC colon cancer - tissues S1(G4) S1(G12) S17

colon cancer carcinoma +adenoma What kind of results do you wish to find ? type A /type B distance matrix

Talk Outline Preprocessing and filtering Clustering of Genes and Conditions Super-Paramagnetic Clustering (SPC) Coupled Two-Way Clustering (CTWC) CTWC server Exercise

CTWC software Web interface –ctwc.weizmann.ac.il –ctwc.bioz.unibas.ch Standalone – Write to

CTWC standalone

Sample Labels Given as a binary file For a cluster Gx, label L with values L1 and L2: Purity(C1, L1) – how much of C1 is composed of L1? Efficiency(C1, L1) – how much of L1 is contained in of C1? #L1 in C |L1| #L1 in C |C1|

Biological Work Literature search for information on interesting genes. Annotation analysis: classify the genes according to their function. Find whether there is a common function or biological meaning for clusters of interest. Find what is in common with sets of experiments/conditions. Genomics analysis: search for common regulatory signal upstream of the genes Design next experiment – get more data to validate result. Remember : most of your work is starting here - understanding the biology behind your results

Summary Clustering methods are used to –find genes from the same biological process –group the experiments to similar conditions Focusing on subsets of the genes and conditions can unravel structure that is masked when using all genes and conditions ctwc.weizmann.ac.il or

Exercise - Course Experiment NT48hr72hr96hr D8D8_NT_s_1b D8_NT_c_1a D8_NT_c_2 D8_48h_s_1b D8_48h_c_1a D8_48h_c_2 D8_72h_s_1b D8_72h_c_1a D8_96h_s_1b D8_96h_c_1a D8_96h_c_2 D11D11_NT_s_2 D11_NT_c_1a D11_NT_c_1b D11_48h_c_1a D11_48h_c_1b D11_72h_s_2 D11_72h_c_1a D11_72h_c_1b D11_96h_c_1a D11_96h_c_1b On time 0 a treatment is given. For D8, treatment suppresses mutp53. For D11, treatment does not.

The Data Save and backup the CEL files!

R Code – From CEL to ECXEL > library(affy) > A = ReadAffy() > rma_data = rma(A) > write.exprs(rma_data, file='rma_expression.txt') > mas5_data = mas5(A) > write.exprs(mas5_data, file = 'mas5_expression') > mas5_calls = mas5calls(A) > write.exprs(mas5_calls, file = 'mas5_detection')

The EXCEL Filter the genes – do not cluster all probesets on the chip!

Edit the EXCEL for CTWC Title #1: U133_AFFX Title #2: NAME Column #2: Probeset info Make the chip names clear!

Samples distance matrix