Discovering Regulatory Networks from Gene Expression and Promoter Sequence Eran Segal Stanford University.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction Ming-Chih Kao, PhD University of.
The multi-layered organization of information in living systems
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
27/06/2005ISMB 2005 GenXHC: A Probabilistic Generative Model for Cross- hybridization Compensation in High-density Genome-wide Microarray Data Joint work.
Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.
From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)
Learning Module Networks Eran Segal Stanford University Joint work with: Dana Pe’er (Hebrew U.) Daphne Koller (Stanford) Aviv Regev (Harvard) Nir Friedman.
Gene regulatory network
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Open Day 2006 From Expression, Through Annotation, to Function Ohad Manor & Tali Goren.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Functional genomics and inferring regulatory pathways with gene expression data.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
Epistasis Analysis Using Microarrays Chris Workman.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown Science Vol. 278.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Lectures 9 – Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
“software” of life. Genomes to function Lessons from genome projects Most genes have no known function Most genes w/ known function assigned from sequence-similarity.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Introduction to biological molecular networks
Cluster validation Integration ICES Bioinformatics.
Module Networks BMI/CS 576 Mark Craven December 2007.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Dynamic Networks: How Networks Change with Time? Vahid Mirjalili CSE 891.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Transcription factor binding motifs (part II) 10/22/07.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

Discovering Regulatory Networks from Gene Expression and Promoter Sequence Eran Segal Stanford University

From Parts to Systems Parts ModulesInteractions Activity

Gene Regulation DNA Gene 2 Gene 1 RNA Protein DNARNA is a tightly regulated process

Gene Regulation DNA Gene 2 Gene 1 RNA Coding Control Coding Control Swi5 Regulator (transcription factor) Swi5 ACGTGC Regulator Motif

Genome-wide Available Data Gene 2 Gene 1 Coding Control Coding Control DNA Sequence Gene Expression mRNA level of all genes Measured in different conditions RNA DNA Microarray ……ACTAGCGGCTATAATGACTGGACCTACGTACCGATATAATGTCAGCTAGCA……

Gene Regulation Gene 2 Gene 1 Coding Control Coding Control ACGTGC Motif Many diagnostic, prognostic and therapeutic implications Regulator Swi5 How are genes regulated? Who regulates whom? How are genes regulated? Who regulates whom? Under which conditions? How are genes regulated? Who regulates whom? Under which conditions? Which genes are co-regulated?

Example: Finding Motifs Cluster gene expression profiles Search for motifs in control regions of clustered genes clustering AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Control regions Gene I Gene II Gene III Gene IV Gene V Gene VI GACTGC AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Experiments Genes Procedural Apply a different method to each type of data Use output of one method as input to the next Motif

Our Approach: Model Based What is a model? A description of the biological process that could have generated the observed data stochastic probabilistic

Our Approach: Model Based Statistical modeling language for biological domains Based on Bayesian networks Classes of objects Properties Observed: gene sequence, experiment conditions Hidden: gene module Interactions Expression level as a function of gene and experiment properties Experiment Gene Expression Condition Module Tumor STGFK ’01 (ISMB)

Tumor Module Level Probabilistic Model Defines a joint distribution Condition Exper. Gene Expression Tumor 1 Module 1 Level 1,1 Condition 1 Level 1,2 Tumor 2 Condition 2 Module 2 Level 2,1 Level 2,2 Bayesian Network P(Level 2,1 | Module 2,Condition 2,Tumor 2 )

Probabilistic Model Defines a joint distribution Learned automatically from data Parameterization Structure Assignment to hidden variables Find model M that maximizes P(M | D) Tumor Module Level Condition Exper. Gene Expression Learn parameterization and structure of distributions Learn network structure Thousands of variables Space of possible networks is super-exponential Probabilistic inference in the Bayesian network Millions of hidden variables Variables are highly dependent NP-Hard Convex optimization Graph theoretic algorithms Dynamic programming Heuristic search Problem-specific structure Modularity in biological systems STGFK ’01 (ISMB)

Analyze results Visualization Literature Statistics Learn model Automatically from data Structure Parameterization Model design Classes of objects Properties Interactions Scheme Model designLearn model Biological problem Data Analyze results Derive biological insights from model STGFK ’01 (ISMB)

Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC

Ongoing Biological Debate Can we discover actual regulators from gene expression data alone?

ActivatorRepressor Regulated gene ActivatorRepressor Regulated gene Activator Regulated gene Repressor State 1 Activator State 2 Activator Repressor State 3 Gene Regulation: Simple Example Regulated gene DNA Microarray Regulators DNA Microarray Regulators

truefalse true false Regulation Tree Activator? Repressor? State 1State 2State 3 true Regulation program Module genes Activator expression Repressor expression SSRPBKF ’03 (Nature Genetics) Genes in the same module share the same regulation program

Module Networks Goal: Discover regulatory modules and their regulators Module genes: set of genes that are similarly controlled Regulation program: expression as function of regulators Modules HAP4  CMK1  true false true false SSRPBKF ’03 (Nature Genetics)

Expression level in each module is a function of expression of regulators Module Network Probabilistic Model Experiment Gene Expression Module Regulator 1 Regulator 2 Regulator 3 Level What module does gene “g” belong to? Expression level of Regulator 1 in experiment BMH1  GIC2  Module P(Level | Module, Regulators) HAP4  CMK1  SSRPBKF ’03 (Nature Genetics)

Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC

Learning Problem Experiment Gene Expression Module Regulator 1 Regulator 2 Regulator 3 Level HAP4  CMK1  Find gene module assignments and tree structures that maximize P(M|D) Goal: Gene module assignments Tree structures Hard Genes: Regulators: ~500 SSRPBKF ’03 (Nature Genetics)

Learning Algorithm Overview Relearn gene assignments to modules clustering Gene module assignment Regulatory modules Learn regulation programs HAP4  CMK1  SSRPBKF ’03 (Nature Genetics)

Learning Regulation Programs Experiments Module genes Experiments sorted in original order Experiments sorted by Hap4 expression log P(M|D)  log P(D| ,  ) + log P( ,  ) HAP4  log P(M|D)  log P(D HAP4  |  HAP4 ,  HAP4  ) + log P(D HAP4  |  HAP4 ,  HAP4  ) + log P(  HAP4 ,  HAP4 ,  HAP4 ,  HAP4  ) SIP4  log P(M|D)  log P(D SIP4  |  SIP4 ,  SIP4  ) + log P(D SIP4  |  SIP4 ,  SIP4  ) + log P(  SIP4 ,  SIP4 ,  SIP4 ,  SIP4  ) log P(M|D)  log P(D HAP4  |  HAP4 ,  HAP4  ) + log P(D CMK1  |  CMK1 ,  CMK1  ) + log P(D CMK1  |  CMK1 ,  CMK1  ) + … HAP4  CMK1  Module genes Hap4 expression Regulator

Learning Algorithm Performance Bayesian score (avg. per gene) Algorithm iterations Algorithm iterations Gene module assignment changes (% from total) Significant improvements across learning iterations Many genes (50%) change module assignment in learning SPRKF ’03 (UAI)

Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC

Yeast Stress Data Genes Selected 2355 that showed activity Experiments (173) Diverse environmental stress conditions: heat shock, nitrogen depletion,…

Comparison to Bayesian Networks Problems Robustness Interpretability Cmk1 Hap4 Mig1 Ste12 Bayesian Network Friedman et al ’00 Hartemink et al. ’01 Yap1 Gic1 Expression level of each gene is a function of expression of regulators Fragment of learned Bayesian network 2355 variables (genes) 173 instances (experiments)

Comparison to Bayesian Networks Problems Robustness Interpretability Cmk1 Hap4 Mig1 Ste12 Bayesian Network Friedman et al ’00 Hartemink et al. ’01 Yap1 Gic1 Module Network SPRKF ’03 (UAI) Solutions Robustness  sharing parameters Interpretability  module-level model Regulator 1 Regulator 2 Regulator 3 Level Module

Comparison to Bayesian Networks Problems Robustness Interpretability Solutions Robustness  sharing parameters Interpretability  module-level model Test Data Log-Likelihood (gain per instance) Number of modules Bayesian Network performance SPRKF ’03 (UAI) Learn which parameters are shared (by learning which genes are in the same module)

Module From Model to Regulatory Modules Regulator 1 Regulator 2 Regulator 3 Level HAP4  CMK1  Biologically relevant? HAP4  CMK1  SSRPBKF ’03 (Nature Genetics)

Respiration Module Regulation program Module genes Energy production (oxid. phos. 26/55 P< ) Hap4+Msn4 known to regulate module genes Module genes functionally coherent? Module genes known targets of predicted regulators?   SSRPBKF ’03 (Nature Genetics) Predicted regulator

Energy, Osomlarity, & cAMP Signaling Regulation by non-TFs (Tpk1 – cAMP-dependent protein kinase) Module genes known targets of predicted regulators? Regulation program Module genes

Biological Evaluation Summary Are the module genes functionally coherent? Are some module genes known targets of the predicted regulators? 46/50 30/50 Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses) Known targets = direct biological experiments reported in the literature SSRPBKF ’03 (Nature Genetics)

Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC

From Model to Detailed Predictions Prediction: Experiment: Regulator ‘X’ regulates process ‘Y’ Knock out ‘X’ and repeat experiment HAP4  Ypl230w X ? SSRPBKF ’03 (Nature Genetics)

Does ‘X’ Regulate Predicted Genes? Experiment: knock out Ypl230w (stationary phase) 1334 regulated genes (312 expected by chance) wild-typemutant >4x Regulated genes Rank modules by regulated genes Predicted modules ModuleSig. Protein foldingP< Cell diferentiationP<0.02 Glycolysis and foldingP<0.04 Mitochondrial and protein fateP<0.04 ModuleSig. Protein foldingP< Cell diferentiationP<0.02 Glycolysis and foldingP<0.04 Mitochondrial and protein fateP<0.04 Modules predicted to be regulated by Ypl230w Ypl230w regulates computationally predicted genes SSRPBKF ’03 (Nature Genetics)

Regulated genes (1014) Ppt1 knockout (hypo-osmotic stress) wild-typemutant Regulated genes (1034) wild-typemutant Kin82 knockout (heat shock) ModuleSig. Energy and osmotic stressP< Energy, osmolarity & cAMP signalingP<0.006 mRNA, rRNA and tRNA processingP<0.02 ModuleSig. Ribosomal and phosphate metabolismP<0.009 Amino acid and purine metabolismP<0.01 mRNA, rRNA and tRNA processingP<0.02 Protein foldingP<0.02 Cell cycleP<0.02 Does ‘X’ Regulate Predicted Genes? SSRPBKF ’03 (Nature Genetics)

Wet Lab Experiments Summary 3/3 regulators regulate computationally predicted genes New yeast biology suggested Ypl230w activates protein-folding, cell wall and ATP-binding genes Ppt1 represses phosphate metabolism and rRNA processing Kin82 activates energy and osmotic stress genes SSRPBKF ’03 (Nature Genetics)

Ongoing Biological Debate Can we discover actual regulators from gene expression data alone? Many regulatory relationships can be induced from gene expression data SSRPBKF ’03 (Nature Genetics)

Undetected regulatorsDetected regulatorsDetected target Assumption: Regulators are transcriptionally regulated Feedforward, auto-regulatory “motifs” (Shen-Orr et al. 2002) TFs and SMs have detectable expression signature Phd1 (TF) Hap4 (TF) Cox4Cox6Atp17 Regulator chain (Respiration) Yap6 (TF) Vid24Tor1Gut2 Auto regulation (Snf kinase regulated processes) Sip2 (SM) Msn4 (TF) Vid24Tor1Gut2 Positive signaling loop (Sporulation & cAMP) Why Does it Work? Statistical methods can infer their regulatory relationships from gene expression data SSRPBKF ’03 (Nature Genetics)

Outline Who regulates whom and when? How are genes regulated? Model Evaluation Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC Reg. ACGTGC Motif

GATAG Motif Activator Repressor From Sequence to Expression ?? ACGTGCGATAG Gene 2Gene 3Gene 1 ? Activator Repressor ACGTGC GATAG + No motifs DNA Microarray DNA control sequence

From Sequence to Expression ACGTGC GATAG + No motifs SequenceExpression Goal: Explain how expression arises from sequence Construct mechanistic model of gene regulation Learn the model from sequence and expression data

Cluster gene expression profiles Search for motifs in control regions of clustered genes clustering AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Control regions Gene I Gene II Gene III Gene IV Gene V Gene VI GACTGC AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Experiments Genes Procedural Apply a different method to each type of data Use output of one method as input to the next Motif Two Phase Approach (I)

Expression clustering is not perfect Cluster II Cluster I Clustering B Shared Motif Clustering A Cluster II Cluster I Shared Motif Two Phase Approach: Problems

Iterate over all sequences of length k Find all genes that have each k-mer in their promoter Keep k-mers whose genes are coherent in expression GATACC ACGACT AAATGC TCGACT CGCTGA ACGAGA TTCGCA CGATGG AAATTA TCGACT GATACC Two Phase Approach (II)

Single motifs may not have coherent expression Activator: Repressor: TCGACTGC GATAC TCGACTGC GATAC TCGACTGC + GATAC TCGACTGC + GATAC TCGACTGC + Two Phase Approach: Problems

Are we missing motifs? TCGACTGC CCAAT + OR ? Two Phase Approach: Problems

A single motif cannot explain variation in expression Activator: Repressor: TCGACTGC GATAC + TCGACTGC GATAC Two Phase Approach: Problems

ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAG CTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACT GATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCG ATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCT AGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGA CTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAG CATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATC GTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG Sequence TCGACTGC GATAC CCAAT TCGACTGC CCAAT GCAGTT TCGACTGCCCAATGATACGCAGTT Motifs TCGACTGC GATAC + CCAAT + GCAGTT CCAAT Motif Profiles Expression Profiles Unified Model of Gene Regulation SYK ’03 (ISMB) Genes

Sequence Motifs TCGACTGC GATAC + CCAAT + GCAGTT CCAAT Motif Profiles Expression Profiles cis-regulatory modules Unified Model of Gene Regulation

Modules Experiments Expression of module genes DNA control sequences of module genes TCGACTGCGATAC +Motif Profile: Regulatory Module SYK ’03 (ISMB)

Sequence Motifs Motif Profiles Expression Profiles Unified model of gene regulation using sequence and expression Model trained as a whole Motif profiles are predictive of expression Expression clusters share motif profiles Motifs added to make profiles predictive Model learned without prior knowledge Input I: sequence data Input II: expression data Our Approach SYK ’03 (ISMB)

Expression clustering is not perfect A single motif cannot explain variation in expression Are we missing motifs? Unified model for expression and motifs Use combinatorial motif profiles Dynamically add motifs to explain expression Problems and Solutions SYK ’03 (ISMB)

Probabilistic Model Experiment Gene Expression Sequence S4S4 S1S1 S2S2 S3S3 R2R2 R1R1 R3R3 Motifs Motif Profiles Expression Profiles P(R 2 |S) = Is motif i “active” in gene g? Position Specific Scoring Matrix (PSSM) SYK ’03 (ISMB)

Experiment Expression Probabilistic Model Gene Sequence S4S4 S1S1 S2S2 S3S3 R1R1 R2R2 R3R3 Module Sequence Motifs Motif Profiles Expression Profiles Module R1R1 R2R2 R3R3 P(Module | R)= softmax Motif profile 1: R 1 R 2 SYK ’03 (ISMB)

Probabilistic Model Experiment Gene Expression Module Sequence S4S4 S1S1 S2S2 S3S3 R1R1 R2R2 R3R3 ID Level Sequence Motifs Motif Profiles Expression Profiles Every module has a unique expression profile 1 Module ID P(Level | Module, ID) SYK ’03 (ISMB)

Probabilistic Model Experiment Gene Expression Module Sequence S4S4 S1S1 S2S2 S3S3 R1R1 R2R2 R3R3 ID Level Sequence Motifs Motif Profiles Expression Profiles genes Motif profile Expression profile Regulatory Modules SYK ’03 (ISMB)

Learning Problem Experiment Gene Expression Module Sequence S4S4 S1S1 S2S2 S3S3 R1R1 R2R2 R3R3 ID Level Sequence Motifs Motif Profiles Expression Profiles Genes: Variables per gene Sequence: 1000 Expression: Motifs: (hidden) Module: 1 (hidden) Learn Module assignments “Active” motifs per gene Motif profiles That maximize P(M|D) Hard SYK ’03 (ISMB)

add/delete motifs X clustering Gene partition motif search Motif set E-step Regulatory modules M-step Learning Algorithm Overview

Motif set Add all sequences of length k as motifs ACGTAGT TGATGCA ACGTGC GCTGGT TTTTAC X Overfitting Use the expression data to guide the search for new motifs Learning the Set of Active Motifs

Examine all regulatory modules Compare genes with motif profile to module genes Add motif initialized to common motif in missed genes Motif profile Expression profile Regulatory Module 1 Motif profile Expression profile Regulatory Module 2 All genes match motif profile Many genes do not match motif profile Add motif CCAAT Dynamically Adding Motifs

Outline Who regulates whom and when? How are genes regulated? Model Evaluation Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC Reg. ACGTGC

Application of Method to Data 4 Expression datasets 500bp upstream seq. YeastHuman 4 Expression datasets 1000bp upstream seq. 77 motif profiles 65 motifs 25 known (out of 37) Method found many known motifs in yeast 62 motif profiles 80 motifs 10 known TRANSFAC (37 known motifs) SYK ’03 (ISMB)

Yeast Human Our method Standard approach Comparison to Standard Approach (Recovery of known motifs) Our method found many more known motifs from the literature SYK ’03 (ISMB)

Caspase 3 Cyclin A2 Cyclin F CDC 2 Centromere A Centromere E kinesin family karyopherin alpha 2 polo-like kinase RGS3 Serine kinase 6 topoisomerase II TTK protein kinase aurora kinase B Kinase family 23 extra spindle pole 1 ARHGAP11A HEC Ubiquitin-conjugating CDC8 DKFZp762E1312 NALP2 C20orf129 DDA3 UBF-fl Cell Division Module in Human DNA control sequence of module genes Expression of module genes NFAT motif Novel motif Module genes functionally coherent? Module genes known to be regulated by predicted motifs? Module genes involved in mitosis (10/25 P<10 -9 )   NFAT regulates cytokine (cell division) genes SYK ’03 (ISMB)

Biological Evaluation Summary Are the module genes functionally coherent? Yeast: module genes functionally coherent? 40/62 65/77 Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses) SYK ’03 (ISMB)

Evaluating Human Motifs Hide sequence of gene i Learn motif model for module Assign gene i to module if gene is in module with Prob.  0.5 Gene 1: TTGACTGCACTCGGCAATTACTATACT Gene 2: AGCACTGCACTGCACTCGACTATACTA Gene 3: TTTTACTATCTCACGATGCACTCGGCC Gene 4: ACACTTACTATACCCTTGCACTCGTAG DNA control sequences Gene 5: Gene 6: Gene 7: Gene 8: TAGGCCAACCCGGTGGCTTACTATACT ACAAACGTGAGTTTTCATCGAGTTCTT ACGTGCACTCGAATATAGTCTTGATTT CTGATCGTAGCGGGTAGCTCGCGAGG Module genes Non- module genes Signal or overfitting? Gene 1: TTGACTGCACTCGGCAATTACTATACT TTTTACTATCTCACGATGCACTCGGCC ACACTTACTATACCCTTGCACTCGTAG P<0.5 (false positive) P  0.5 (true positive) Classification margin = True positives (%) – False positives (%) Repeat for all genes SS ’04 (RECOMB) TGCACTCG Motifs: TTACTAT

Tumor antigen Transcription co-repressor Protein phosphatase Chemokine receptor Nuclear lamina G-protein signaling ATpase activity Regulation of cdk Two-component signal transduction CAMP dependant protein kinase Manganese ion binding Protein folding Carbohydrate binding Regulation of cdk Chemokine receptor binding Translation initiation Mitochondrial membrane Protein phosphatase Protein folding Trypsin activity Lysosome Secretory vesicle Serine protease inhibitor Protein kinase ck2 26s proteasome Pathogenesis Epidermal differentiation Antimicrobial peptide activity Tyrosine kinase signaling pathway Kinase regulator Pregnancy Taxis Protein phosphatase regulator Sugar binding Mitochondrial membrane Interleukin binding Ubiquitin cycle Cytokinesis Epidermal differentiation Regulation of t-cell proliferation Embryogenesis and morphogenesis Nucleolus Nucleotide biosynthesis Antimicrobial peptide Thermoregulation Oxidoreductase on paired donors Muscle contraction Transcription co-repressor Protein phosphatase Metal ion transport Cytosolic calcium ion concentration GTPase regulator Transcription factor complex protein-nucleus import Ligase activity Energy derivation by oxidation Extracellular ligand-gated ion channel Translation release factor G-protein signaling Serine protease inhibitor Energy taxis GTPase mediated signal transduction ATP dependent helicase activity Transcription from pol I promoter Nucleosome disassembly tRNA metabolism Sphingolipid metabolism NADH dehydrogenase activity Xenobiotic metabolism Small monomeric gtpase Nucleosome assembly Monooxygenase activity RNA dependent ATPase activity Steroid metabolism Uptake permease activity Transcription from pol II promoter Xenobiotic metabolism RNA splicing DNA-dependent ATPase activity DNA recombination Small ribosomal subunit Classification margin Modules Best classification margin from 100 random modules HSF is known to regulate protein folding Motif: HSF Genes: Protein folding Motif: GATA Genes: Mitochondrial GATA is known to activate mitochondrial membrane genes Evaluating Human Motifs MINI19 ETS1 BRACH NFX6 GATA1 XFD3 XBP1 E2F MAF GNCF1, GATA1 PAX1 ELK1 RORA2 GFI1 HOGNESS SRF BARBIE STAT5A RORA2 E2F HNF1 ZF5 TAACC ARNT NFKAPPAB RORA2 NFMUE1 HOX13 TAXCREB OCT1 ARNT MEF2 PAX1 ARNTOCT1 R_01MUSCLE_INI AREB6 OCT1 NFKAPPAB HSF ERG1 GATA1 HNF1 GIF1 NFY, ACAAT MYCMAX Modules SS ’04 (RECOMB)

Compendium of human cis-regulatory modules Module genes are functionally coherent Module genes similarly expressed in external datasets Learned motifs characterize module genes Biological Evaluation Summary

Incorporating Protein-DNA binding Protein-DNA Binding Identifies all the genes that are bound by a regulator Noisy assay Gene 2 Gene 1 Coding Control Coding Control Reg.

Incorporating Protein-DNA binding Experiment Gene Expression Module Sequence S4S4 S1S1 S2S2 S3S3 R1R1 R2R2 ID Level SBSFK ’02 (RECOMB) Does regulator 3 bind to gene g? Protein-DNA data for regulator i is a noisy sensor for regulation by motif i Is the motif recognized by regulator 3 “active” in gene g? R3R3 P1P1 P2P2 P3P3

Outline Who regulates whom and when? How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Reg. ACGTGC Reg. ACGTGC

Model Assumption Experiment Gene Expression Module Regulator 1 Regulator 2 Regulator 3 Level Every gene belongs to exactly one module Assumption: X X X

Multi-Functional Genes Model Gene 2 Every gene can belong to multiple modules Module 1 Gene 1 Module 2 Gene 3 Gene 2 The expression of a gene is the sum of its expression in each module it participates Gene 2 expression: +=

Multi-Functional Genes Model Gene Expression M3M3 M2M2 A3A3 A2A2 Experiment Is gene “g” part of module i? M1M1 Activity level of module i in experiment A1A1 Expression is a sum of activity level of all modules Level g,e ~N(  g.M i  e.A i,  ) Level SBK ’03 (PSB)

Connection to SVD Singular Value Decomposition Experiments Genes Modules Experiments =xx E=M  A T Golub et al. ’96 Alter et al. ’00 Level g,e =  i g.M i  e.A i Level g,e ~N(  g.M i  e.A i,σ) Gene Expression M3M3 M2M2 M1M1 A3A3 A2A2 A1A1 Level Experiment SBK ’03 (PSB) Learning problem Module assignments Module activity levels Difference to our model: Discrete module assignments Hard

A 11 A 12 A 13 Hidden M 12 M 11 M 13 Hidden Hard M 12 Level 11 A 11 Level 12 Level 21 Level 22 Bayesian Network A 12 A 13 M 11 M 13 A 21 A 22 A 23 M 12 M 11 M 13 (3 Modules, 2 genes, 2 experiments) Learning Assignments and Activities Every pair of hidden vars. are dependent Standard approximations Loopy belief propagation Variational methods Genes: Experiments: ~200 Modules: ,000,000 dependent hidden variables At best, local maximum of approximate energy function SBK ’03 (PSB)

A 11 A 12 A 13 Observed M 12 M 11 M 13 Hidden Easy GO A 11 A 12 A 13 Hidden M 12 M 11 M 13 Observed Easy GO Level 11 Level 12 Level 21 Level 22 Bayesian Network M 12 M 11 M 13 M 12 M 11 M 13 A 11 A 12 A 13 A 21 A 22 A 23 (3 Modules, 2 genes, 2 experiments) Learning Assignments and Activities Optimize activities given assignments Optimize assignments given activities M 12 M 11 M 13 Initialize Standard approximations converge (at best) to local maximum of approximate energy function Our algorithm converges to strong local maximum SBK ’03 (PSB) A 11 A 12 A 13 Hidden M 12 M 11 M 13 Hidden Hard

A 11 A 12 A 13 Hidden M 12 M 11 M 13 Observed Easy GO Level 11 Level 12 Level 21 Level 22 Bayesian Network M 12 M 11 M 13 M 12 M 11 M 13 A 11 A 12 A 13 A 21 A 22 A 23 (3 Modules, 2 genes, 2 experiments) Learning Module Activity Levels A ij variables are continuous Standard least squares problem Optimization problem: SBK ’03 (PSB)

A 11 A 12 A 13 Observed M 12 M 11 M 13 Hidden Level 11 Level 12 Level 21 Level 22 Bayesian Network M 12 M 11 M 13 M 12 M 11 M 13 A 11 A 12 A 13 A 21 A 22 A 23 (3 Modules, 2 genes, 2 experiments) Learning Module Assignments M ij variables are discrete For each gene, combinatorial search in time 2 m Optimization problem:

A 11 A 12 A 13 Observed M 12 M 11 M 13 Hidden Level 11 Level 12 Level 21 Level 22 Bayesian Network M 12 M 11 M 13 M 12 M 11 M 13 (3 Modules, 2 genes, 2 experiments) Learning Module Assignments Optimize for continuous M ij For each gene i, select k largest variables from {M i1,…,M im } Combinatorial search in time 2 k Optimization problem:

Comparison to Plaid (Lazzeroni and Owen ’02) Log (P-value) Compare P-value of enrichment for functional annotations (GO) (P-value of annotation enrichment = best hypergeometric p-value in any module) Plaid Our method 122 of 137 annotations more significant in our model SBK ’03 (PSB)

Comparison to Standard Clustering Compare P-value of enrichment for functional annotations (GO) (P-value of annotation enrichment = best hypergeometric p-value in any module) Log (P-value) Hierarchical clustering Our method 120 of 137 annotations more significant in our model SBK ’03 (PSB)

Adding the Regulation Model Experiment Gene Expression Regulator 1 Regulator 2 Regulator 3 M3M3 M2M2 M1M1 A3A3 A2A2 A1A1 Level Activity level of module i in array HAP4  CMK1  BSK ’04 (RECOMB) Gene Expression M3M3 M2M2 M1M1 Level A3A3 Experiment A2A2 A1A1

Outline Who regulates whom and when? How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Robust prediction of gene function Identifying conserved modules Reg. ACGTGC Reg. ACGTGC

Single Species Gene Expression Co-expression is not always functionally relevant Noise in DNA microarray technology Biological sloppiness Use evolution as a filter

Multiple Species Gene Expression Different organisms share many of their genes Can we learn something from observing the expression of the same gene in multiple species? Yeast Orthologs Human ~30% of yeast genes are conserved in human Irrelevant co-expression is uncorrelated in different species Relevant co-expression confers selective advantage Combining expression from multiple species can improve gene function and regulatory module discovery

Conserved Co-Expression Network Yeast (643) Worm (949) Fly (155) Human (1202) Connect genes that are co-expressed in at least two organisms 3D visualization of network SSKK ’03 (Science)

Ribosome biogenesis Energy generation Cell cycle Secretion Neuronal Proteasome General transcription Ribosomal subunits Signaling Translation initiation and elongation Lipid metabolism Unknown Conserved Co-Expression Network SSKK ’03 (Science)

Classification Accuracy (%) 40 Annotations at 50% accuracy 70 Annotations at 30% accuracy Gene annotations (Gene Ontology) Predicting Gene Function Predict function using guilt-by-association scheme Protein modification SSKK ’03 (Science)

Predicting Protein Modification WormFlyHumanYeast 12% 18% 15% 13% 76% Multiple species prediction predictions using single species Significant improvements over any single species network Classification Accuracy (%) (50 most confident predictions) SSKK ’03 (Science)

Excess nuclei in mutant Biological Experiment Prediction: Experiment: Consistent with cell proliferation prediction ZK652.1 plays a role in cell proliferation Knock-out ZK652.1 and test mutant SSKK ’03 (Science)

Outline Who regulates whom and when? How are genes regulated? Regulation of multi-functional genes Evolution of gene regulation Robust prediction of gene function Identifying conserved modules Reg. ACGTGC Reg. ACGTGC Reg. ACGTGC Reg. ACGTGC Mouse Human

Gene Experiment Expression Regulator 1 Regulator 2 Regulator 3 Level Organism 2 Module Experiment Gene Expression Regulator 1 Regulator 2 Regulator 3 Level Organism 1 Module Conserved Gene Regulation Model Compatibility potential  (Module,Module) Orthologs are more likely to be in the same module Module 123 Regulation programs for the same module are more likely to share regulators

Human (138)Mouse (42) Conserved Regulation Normal brain (4) Brain tumors Gliomas (57) Medulloblastoma (60) Miscellaneous (17) Brain development (39) Brain tumors Medulloblastoma (3) Goal: Discover regulators in brain that are shared between human and mouse

Comparison to Single Species Test Data Log-Likelihood (gain per gene) Human Single species Multiple species Mouse Single species Multiple species Single species By combining expression data from mouse, we can learn a better model of gene regulation in human

MouseHuman Neuron Differentiation Module NeuroD1 Brain expressed genes (18/34 P< ) Module genes functionally coherent? Module genes known targets of predicted regulators?   NeuroD known to regulate module genes

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Reg. ACGTGC Reg. ACGTGC Mouse Human Finding conserved regulators SSKK ’03 (Science) Reg. ACGTGC Finding motifs SS ’04 (RECOMB) SBSFK ’02 (RECOMB) SYK ’03 (ISMB) Reg. ACGTGC Finding regulators SSRPBKF ’03 (Nature Gen.) SPRKF ’03 (UAI) BSK ’04 (RECOMB)

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Gene regulation Two-sided clustering Learning abstraction hierarchies Discovering molecular pathways Learning with clinical data SOK ’01 (NIPS) SK ’02 (RECOMB) STGFK ’01 (ISMB) SWK ’03 (ISMB) SSKK ’03 (Science)SSRPBKF ’03 (Nature Gen.) SPRKF ’03 (UAI) SS ’04 (RECOMB) SBSFK ’02 (RECOMB) SYK ’03 (ISMB) SBK ’03 (PSB) BSK ’04 (RECOMB)

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Unified Approach for Heterogeneous Data Gene expression DNA sequence Protein-DNA binding data Multiple species data Protein-protein interaction data SBSFK ’02 (RECOMB) SWK ’03 (ISMB) SSKK ’03 (Science) SSRPBKF ’03 (Nature Gen.) SYK ’03 (ISMB)SS ’04 (RECOMB) SBK ’03 (PSB)

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Unified Approach for Heterogeneous Data Model Automatically Learned from Data Convex optimization Graph theoretic algorithms Exploit modularity in biological system Exploit problem-specific structure Model designLearn model Data Analyze results Dynamic programming Heuristic search

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Unified Approach for Heterogeneous Data Model Automatically Learned from Data Model Evaluation Methods Comparison to existing methods Cross validation Enrichment for known biological function Relative to current knowledge in literature

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Unified Approach for Heterogeneous Data Model Automatically Learned from Data Model Evaluation Methods Testable Biological Hypotheses Generate novel hypotheses from model Wet-lab validation of predictions SSKK ’03 (Science) SSRPBKF ’03 (Nature Gen.)

Summary: Probabilistic Framework Rich Modeling Language for Biological Processes Unified Approach for Heterogeneous Data Model Automatically Learned from Data Model Evaluation Methods Testable Biological Hypotheses Visualization Software

The Challenge Ahead Organisms Data types Conditions Developmental Physiological Environmental Clinical Metabolic Experimental Protein expression Tissue specific expression Interaction data Location data … Biological information ?