Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Outlines Background & motivation Algorithms overview
Inferring Quantitative Models of Regulatory Networks From Expression Data Iftach Nachman Hebrew University Aviv Regev Harvard Nir Friedman Hebrew University.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction Ming-Chih Kao, PhD University of.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Open Day 2006 From Expression, Through Annotation, to Function Ohad Manor & Tali Goren.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Signal Processing in Single Cells Tony 03/30/2005.
Exhaustive Signature Algorithm
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Gene expression analysis summary Where are we now?
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Reconstructing Transcription Network in S.cerevisiae WANG Chao Oct. 4, 2004.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae Speaker: Zhu YANG 6 th step, 2006.
Modular Organization of Protein Interaction Network Feng Luo, Ph.D. Department of Computer Science Clemson University.
Indiana University Bloomington, IN Junguk Hur Computational Omics Lab School of Informatics Differential location analysis A novel approach to detecting.
Biological networks Construction and Analysis. Recap Gene regulatory networks –Transcription Factors: special proteins that function as “keys” to the.
Fuzzy K means.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Epistasis Analysis Using Microarrays Chris Workman.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Some views on microarray experimental design Rainer Breitling Molecular Plant Science Group & Bioinformatics Research Centre University of Glasgow, Scotland,
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Reconstructing gene networks Analysing the properties of gene networks Gene Networks Using gene expression data to reconstruct gene networks.
CISC841, F08, Lec2, Liao CISC 841 Bioinformatics (Fall 2008) A Primer on Molecular Biology & Bioinformatics challenges.
P. falciparum Life Cycle & Pathogenesis of Malaria Miller et al., Nature  Molecular and genetic.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Changes in Gene Regulation in Δ Zap1 Strain of Saccharomyces cerevisiae due to Cold Shock Jim McDonald and Paul Magnano.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions Jiajian Liu and Gary D. Stormo Presented by Aliya.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Sudhakar Jonnalagadda and Rajagopalan Srinivasan
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Achim Tresch Computational Biology Gene Center Munich (The Sound of One-Hand Clapping) Modeling Combinatorial Intervention Effects in Transcription Networks.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Transcription factor binding motifs (part II) 10/22/07.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Outline S. cerevisiae, a eukaryote known for cold-shock adaption, used in cold-shock experiments Deletion strand HMO1 and the comparison of microarray.
Overview  Introduction  Biological network data  Text mining  Gene Ontology  Expression data basics  Expression, text mining, and GO  Modules and.
Yeast Cell-Cycle Regulation Network inference Wang Lin.
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Cold Adaptation in Budding Yeast
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Loyola Marymount University
CSCI2950-C Lecture 13 Network Motifs; Network Integration
Cold Adaptation in Budding Yeast
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Functional classification and visualization of differentially expressed genes. Functional classification and visualization of differentially expressed.
Presentation transcript:

Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Outline Introduction – transcription, regulation, gene chips, and transcription modules. Iterative Signature Algorithm (ISA). Advantages of Progressive Iterative Signature Algorithm (PISA). PISA applied to yeast data.

Transcription regulation

Gene chips DNA microarray

Gene-expression profile E gc g=1,2,...,N g c=1,2,...,N c But data very noisy…

Transcription module C1C1 C2C2 C3C3 Conditions G1G1 G7G7 G2G2 G3G3 G4G4 G5G5 G6G6 Genes TF 1 TF 2 TF 3 TF 4 Transcription factors A Transcription Module: a set of conditions and a set of genes connected by a transcription factor.

A gene can be in multiple transcription modules. Conditions Genes c 1 c 2 c 3 … … c m … … c n c N c g 1 g 2 g 3. g i. g j. g N g Signature of a transcription module

Iterative Signature Algorithm (ISA) Barkai group (2002,2003) Transcription Module (TM) Gene vector and condition vector: Conditions Genes c 1 c 2 c 3 … … c m … … c n c N C g 1 g 2 g 3. g i. g j. g N G Thresholding on both genes and conditions reduces noise. Thresholding:

Limitations of ISA Lots of spurious modules (millions…). Weak modules may be absorbed by strong ones. ISA does not make use of identified modules to find new ones. c 1 c 2 c 3 … … c m … … c n c N c g 1 g 2 g 3. g i. g j. g Ng

Progressive Iterative Signature Algorithm (PISA) c 1 c 2 c 3 … … c m … … c n c N c g 1 g 2 g 3. g i. g j. g N g

Advantages of PISA over ISA Removing found modules reveals “hidden” modules, and reduces noise for unrelated modules. No positive feedback. Improved thresholding for genes. Combines coregulated and counter-regulated genes.

Example of PISA vs. ISA TF 1 TF 2 G1G1 G2G2 AB

The gene-score threshold Goal: less than one gene included in the module by mistake. Require: threshold that is insensitive to (unknown) module size. Gene scores along the condition vector for some module

Eliminating false modules For scrambled data, preliminary modules either have few genes or few contributing conditions. True positives

PISA applied to yeast data Applied PISA to a dataset containing almost all available microarray data for S. cerevisiae: >6000 genes, ~1000 conditions. Found ~140 different modules, including all “good” modules found by ISA. Found some unknown modules. Found many “good” small modules that ISA could not find / separate from the spurious modules. ~2600 genes in at least one module, ~900 genes in more than module.

Some modules found by PISA

Example: Zinc module ZRT1 YNL254C INO1ZAP1 YOL154W ADH4 ZRT3ZRT2 YOR387C ZRT1 ZAP1 ZRT2 YNL254C YOL154W ZRT3 ADH4 RAD27 ZRC1 … Lyons et al., PNAS 97, (2000) ZAP1-regulated genes during zinc starvation. Zinc module found by PISA

Comparison with other databases “Gold standard”: Gene Ontology (Genome Res. 11, (2001)) Database A: Immunoprecipitation (Lee et al., Science 298, (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, (2003))

anticorrelated correlated Oxidative stress response(69) De novo purine biosyn (32) Lysine biosyn (11) Biotin syn & transport (6) Arg biosyn (6) aa biosyn (96) Oxidative stress response (69) aryl alcohol dehydrogenase (6) proteolysis (27) trehalose & hexose metabolism/conversion (21) COS genes (11) heat shock (52) repair of disulfide bonds (26) Mating genes for type a (15) Mating type a signaling genes (6) Mating (110) Mating factors/receptors: a/  difference (26) rRNA processing (117) Ribosomal proteins (126) Histone (19) Fatty acid syn ++ (22) Cell cycle G2/M (31) Cell cycle M/G1 (35) Cell cycle G1/S (66) Correlations

Summary Data from gene chips can be used to identify transcription modules (TMs). Iterative approach (ISA) is promising. PISA improves on ISA by taking out found TMs. –PISA also improves gene thresholding, avoids positive feedback, and improves signal to noise by grouping coregulated and counter-regulated genes. –PISA very effective for finding “secondary modules”.

Future Directions Input to experiment: –new modules and new genes in old modules. –what kinds of experiments give the most informative data? Improve PISA: –better pre/post-processing of data. Apply PISA to other organisms. Combine PISA with other data (experimental, bioinformatic) to systematically identify TMs, and reconstruct the transcription network.

De novo purine biosynthesis Number of genes: 32 Average number of contributing conditions: 14.6 Consistency: 0.59 Best ISA overlap: 0.59 at t G =5.0; frequency 16

Galactose induced genes Number of genes: 23 Average number of contributing conditions: 18.1 Consistency: 0.55 Best ISA overlap: 0.74 at t G =3.2; frequency 686

Hexose transporters Number of genes: 10 Average number of contributing conditions: 33.7 Consistency: 0.59 Best ISA overlap: 0.6 at t G =3.8; frequency 41

Peroxide shock Number of genes: 69 Average number of contributing conditions: 23.9 Consistency: 0.50 Best ISA overlap: 0.34 at t G =3.4; frequency (1)

Implementation of PISA Normalization of gene-expression data Iterative algorithm to find preliminary modules (modified ISA) –avoiding positive feedback –gene-score threshold Orthogonalization Finding consistent modules

Normalization of expression data Gene-score matrix E G : Condition-score matrix E C : removes reference-condition bias normalizes total RNA levels makes gene scores comparable  makes condition scores comparable

Iterative algorithm: modified ISA (mISA) Start with a random set of genes G I. Produce condition-score vector s C. Produce gene-score vector s G, using “leave-one-out” scoring to avoid positive feedback. From s G, calculate gene vector m G for next iteration.

Orthogonalization After finding each converged preliminary module (s G, s C ), remove component along s C from all genes: s1Cs1C s’s’ s2Cs2C

Why does scrambled data yield large modules? Long tails of expression data lead to single-condition modules.

Finding consistent modules Repeat PISA runs many times (~30). Tabulate preliminary modules. A preliminary module contributes to a module if: –the preliminary module contains > 50% of the genes in the module, –these genes constitute > 20% of the preliminary module. A gene is included in a module if it appears in >50% of the contributing modules, always with the same gene-score sign.

Comparison with other databases Gene Ontology (Genome Res. 11, (2001)) Database A: Immunoprecipitation (Lee et al., Science 298, (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, (2003)) N g — number of genes in organism m — number of genes in module c — number of genes in GO category n — number of genes in both module and GO category p value:

Correlation of modules Conditions Genes c 1 c 2 c 3 … … c m … … c n c Nc g 1 g 2 g 3. g i. g j. g Ng