APO-SYS workshop on data analysis and pathway charting Igor Ulitsky Ron Shamir ’ s Computational Genomics Group
Part I: Presentations EXPANDER AMADEUS SPIKE MATISSE
Part II: Hands-on Session EXPANDER MATISSE SPIKE
EXPression ANalyzer and DisplayER Adi Maron-Katz Chaim Linhart Amos Tanay Rani Elkon Israel Steinfeld Seagull Shavit Igor Ulitsky Roded Sharan Yossi Shiloh Ron Shamir
EXPANDER –Low level analysis: Missing data estimation (KNN or manual) Normalization: quantile, loess Filtering: fold change, variation, t-test Standardization: mean 0 std 1, take log, fixed norm –High level gene partition analysis: Clustering Biclustering –Ascribing biological meaning to patterns: Enriched functional categories (Gene Ontology) Identify transcriptional regulators – promoter analysis Built-in support for 9 organisms: –human, mouse, rat, chicken, zebrafish, fly, worm, arabidopsis, yeast
Clustering (CLICK, SOM, K-means, Hierarchical) Input data Biclustering (SAMBA) Functional enrichment (TANGO) Normalization/ Filtering Promoter signals (PRIMA) Links to public annotation databases Visualization utilities
EXPANDER - Preprocessing Input data: Expression matrix (probe-row; condition-column) Expression matrix One-channel data (e.g., Affymetrix) Dual-channel data (cDNA microarrays, data are (log) ratios between the Red and Green channels) ‘.cel’ files ID conversion file: map probes to genes ID conversion file Gene sets data Data definitions: Defining condition subsets Data type & scale (log)
EXPANDER – Preprocessing (II) Data Adjustments: Missing value estimation (KNN or arbitrary) Merging conditions Normalization: removal of systematic biases from the analyzed chips Implemented methods: quantile, lowess Visualization: box plots, scatter plots (simple, M vs. A)box plots
EXPANDER – Preprocessing (III) Filtering: Focus downstream analysis on the set of “responding genes” Fold-Change Variation Statistical tests (T-test) Standardization : Create a common scale Standardization For each probe Mean=0, STD=1 Log data (base 2) Fixed Norm (divide by norm of probe vector)
Clustering (CLICK, SOM, K-means, Hierarchical) Input data Biclustering (SAMBA) Functional enrichment (TANGO) Normalization/ Filtering Promoter signals (PRIMA) Links to public annotation databases Visualization utilities
Cluster Analysis Partition the responding genes into distinct sets, each with a particular expression pattern Identify major patterns in the data: reduce the dimensionality of the problem co-expression → co-function co-expression → co-regulation Partition the genes to achieve: Homogeneity: genes inside a cluster show highly similar expression pattern. Separation: genes from different clusters have different expression patterns.
Cluster Analysis (II) Implemented algorithms: – CLICK, K-means, SOM, Hierarchical Visualization: – Mean expression patternsMean expression patterns – Heat-mapsHeat-maps
Ionizing Radiation Effectors (p53, BRCA1, CHK2) DNA repair Cell cycle arrest Stress responses Survival pathways Apoptosis Cell death pathways Sensors ATM Double Strand Breaks Example study: responses to ionizing radiation
Example study: experimental design Genotypes: Atm-/- and control w.t. mice Tissue: Lymph node Treatment: Ionizing radiation Time points: 0, 30 min, 120 min Microarrays: Affymetrix U74Av2 (12k probesets)
Test case - Data Analysis Dataset: six conditions (2 genotypes, 3 time points) Normalization Filtering step – define the ‘responding genes’ set genes whose expression level is changed by at least 1.75 fold Over 700 genes met this criterion The set contains genes with various response patterns – we applied CLICK to this set of genes
Major Gene Clusters – Irradiated Lymph node Atm-dependent early responding genes
Major Gene Clusters – Irradiated Lymph node Atm-dependent 2 nd wave of responding genes
Clustering (CLICK, SOM, K-means, Hierarchical) Input data Biclustering (SAMBA) Functional enrichment TANGO (TANGO) Normalization/ Filtering Promoter signals (PRIMA) Links to public annotation databases Visualization utilities
Ascribe Functional Meaning to the Clusters Gene Ontology (GO) annotations for human, mouse, rat, chicken, fly, worm, Arabidopsis, Zebrafish and yeast. TANGO: Apply statistical tests that seek over-represented GO functional categories in the clusters.TANGO
Enriched GO Functional Categories Hierarchical structure → highly dependent categories. Problems: –High redundancy –Multiple testing corrections assume independent tests TANGO
Functional Enrichment - Visualization
Functional Categories cell cycle control (p<1x10 -6 )
Cell cycle control (p<5x10 -6 ) Apoptosis (p=0.001) Functional Categories
Clustering (CLICK, SOM, K-means, Hierarchical) Input data Biclustering (SAMBA) Functional enrichment (TANGO) Normalization/ Filtering Promoter signals (PRIMA) Links to public annotation databases Visualization utilities
?????p53TF-CTF-B TF-A NEW ATM g3g13g12g10g9g1g8g7g6g5g4g11g2 Hidden layer Observed layer Clues are in the promoters Identify Transcriptional Regulators
‘Reverse engineering’ of transcriptional networks Infers regulatory mechanisms from gene expression data –Assumption: co-expression → transcriptional co-regulation → common cis-regulatory promoter elements Step 1: Identification of co-expressed genes using microarray technology (clustering algs) Step 2: Computational identification of cis- regulatory elements that are over-represented in promoters of the co-expressed gene
PRIMA – general description Input: –Target set (e.g., co-expressed genes) –Background set (e.g., all genes on the chip) Analysis: –Identify transcription factors whose binding site signatures are enriched in the ‘Target set’ with respect to the ‘Background set’. TF binding site models – TRANSFAC DB Default: From bp to 200 bp relative the TSS
Promoter Analysis - Visualization
PRIMA - Results
P-valueEnrichment factor Transcription factor P-valueEnrichment factor Transcription factor 6.0x CREB PRIMA – Results NF- B x10 -8 p x10 -7 STAT x10 -6 Sp x10 -4
Clustering (CLICK, SOM, K-means, Hierarchical) Input data Biclustering (SAMBA) Functional enrichment (TANGO) Normalization/ Filtering Promoter signals (PRIMA) Links to public annotation databases Visualization utilities
Biclustering Clustering becomes too restrictive on large datasets: Seeks global partition of genes according to similarity in their expression across ALL conditions Relevant knowledge can be revealed by identifying genes with common pattern across a subset of the conditions Biclustering algorithmic approach
* Bicluster (=module) : subset of genes with similar behavior in a subset of conditions * Computationally challenging: has to consider many combinations of sub-conditions Biclustering: SAMBA Statistical Algorithmic Method for Bicluster Analysis A. Tanay, R. Sharan, R. Shamir RECOMB 02
Biclustering Visualization
Expression Data – Input File probes conditions
ID Conversion File
Normalization: Box plots Log (Intensity) Median intensity Upper quartile Lower quartile
Standardization of Expression Levels After standardization Before standardization
Cluster Analysis: Visualization (I)
BeforeAfter Cluster I Cluster II Cluster III Cluster Analysis - Visualization (II)