Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir.

Similar presentations


Presentation on theme: "Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir."— Presentation transcript:

1 Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir

2 Gene expression regulation Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (miRNAs)

3 TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A,T}, B={C,G,T}) PWM = Position weight matrix 654321 00.20.700.80.1A 0.60.40.10.50.10C 0.40.10.500G 0.300.10 0.9T  Cutoff = 0.009 AGCTACACCCATTTAT 0.06 AGTAGAGCCTTCGTG 0.06 CGATTCTACAATATGA 0.01 ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA

4 Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery: The typical two-step pipeline Promoter/3’UTR sequences Motif discovery Co-regulated gene set

5 Motif discovery: Goals and challenges Goal: Reverse-engineer the transcriptional regulatory network Challenges:  BSs are short and degenerate (non-specific)  Promoters are long + complex (hard to model)  Search space is huge (motif and sequence)  Data is noisy  What to look for? (enriched?, localized?, conserved?) Problem is still considered very difficult despite extensive research [Tompa ’05]

6 Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? AA general pipeline architecture for enumerating motifs. DDifferent statistical scoring schemes of motifs for different motif discovery tasks.

7 Motif search algorithm Pipeline of refinement phases Each phase receives best candidates of previous phase, and refines them First phases are simple and fast (e.g., try all k-mers); Last phases are more complex (e.g., optimize PWM) k-mer Preprocess Mismatch List of k-mers Merge PWM Optimization Cutoff = 0.005 PWM  Motif Model:  Phases:

8 PWM optimization phase

9 Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

10 Input: Target set (T) = co-regulated genes Background (BG) set (B) = entire genome No sequence model is assumed! Motif scoring: Hypergeometric (HG) enrichment score b, t = BG/Target genes containing a hit B T Task I: Over-represented motifs in given target set b t ! BG set should be of the same “nature” as the target set, and much larger E.g., all genes on microarray

11 Input: Target set (T) = co-regulated genes Background (BG) set (B) = entire genome No sequence model is assumed! Motif scoring: Hypergeometric (HG) enrichment score b, t = BG/Target genes containing a hit B T Task I: Over-represented motifs in given target set b t

12 Drawback of the HG score Length/GC-content distribution in the target set might significantly differ from the distribution in the BG set  Very common in practice due to correlation between the expression/function of genes and the length/GC-content of their promoters and 3’ UTRs  The HG score might fail to discover the correct motif or detect many spurious motifs → Use the binned enrichment score  Slightly less sensitive than HG score…  … but takes into account length/GC-content biases

13 Drawback of the HG score Length/GC-content distribution in the target set might significantly differ from the distribution in the BG set Very common in practice due to correlation between the expression/function of genes and the length/GC-content of their promoters and 3’ UTRs H 0 assumes uniform sampling The HG score might fail to discover the correct motif or detect many spurious motifs

14 p m = prob. of a target set gene to contain a hit Binned enrichment score Key idea: Binning sequences  B i, T i = BG/Target genes in i-th bin  b i =motif hits in i-th bin. t = b∩T  Bins sampling probability :  Assume uniform sampling per bin: Assume that |T| target genes are sampled with replacement from B Length GC-content B1B1 T1T1 b1b1 B2B2 b2b2 B3B3 T3T3 b3b3 T4T4 b4b4 T2T2 B4B4 20-40% 40-60% 0.4-0.7kbp 0.7-1kbp

15 Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02] Test case: Human G2+M cell-cycle genes CHR NF-Y (CCAAT-box) These motifs form a module associated with G 2 +M [Elkon et al. ’03,Tabach et al. ’05, Linhart et al. ’05] Pairs analysis

16 Results: Human G 2 +M cell-cycle genes ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02]. CHR CCAAT-box Both motif are associated with G 2 +M [Elkon et al. ’03,Tabach et al. ’05, Linhart et al. ’05].

17 Benchmark I: Yeast TF target sets Source: ChIP-chip [Harbison et al., ’04] Data: 173 target-sets of 83 TFs with known BS motifs Average set size: 58 genes (=35 Kbps) Success rates: (for top 2 motifs of lengths 8 & 10)

18 Benchmark: Real-life metazoan datasets We constructed the first motif discovery benchmark that is based on a large compendium of experimental studies Source: Various (expression, ChIP-chip, Gene Ontology, …) Data: 42 target-sets of 26 TFs and 8 miRNAs from 29 publications Species: human, mouse, fly, worm Average set size: 400 genes (=383 Kbps) Binned score improvement

19 Similarity between two motifs Euclidean: Pearson correlation coefficient: Kullback-Leibler divergence (relative entropy):

20 Metazoan benchmark: Other assessment methods for success rate

21 Metazoan benchmark: Detailed results

22 Binned score - examples Mef2 fly target-set [Blais et al. ’05]  Promoters longer than average (972bp vs. 840bp)  Promoter have higher GC-content (53% vs. 49%)  None of the programs discovered the correct motif  Binned score -> Mef2 is the top-scoring motif hsa-miR-16 target-set [Linsley et al. ’07]  3’UTRs longer than average (~1700bp vs. ~960)  HG score: hsa-miR-16 signature is the top scoring motif. But 10 more motifs with p<1E-14.  Binned score: the correct motif p=1.7E-33. No spurious motifs. 198 mouse odorant receptors promoters [Michaloski et al. 06’]  highly AT-rich (35% vs. ~25% in the BG)  HG score: Olf-1 was the third best motif after AT-rich motifs  Binned score: Olf-1 top scoring motifs

23 Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

24 Amadeus – Global spatial analysis Promoter sequences Output Motif(s) Gene expression microarrays Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Co-regulated gene set

25 Task II : Global analyses Localization w.r.t the TSS Strand-bias Chromosomal preference TSS 5’ Scores for spatial features of motif occurrences Input: Sequences (no target-set / expression data) Motif scoring:

26 Global analysis I: Localized human + mouse motifs Input: All human & mouse promoters (2 x ~20,000) Score: localization

27 Global analysis II: Chromosomal preference in C. elegans Input: All worm promoters (~18,000) Score: chromosomal preference Results: Novel motif on chrom IV

28 Amadeus is available at: “Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets”, C. Linhart*, Y. Halperin*, R. Shamir, Genome Research 18:7, 2008 (*equal contribution) http://acgt.cs.tau.ac.il/amadeus

29 Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

30 PRIMA – GOAL: ‘ Reverse engineering ’ of transcriptional networks Co-expression → Co-regulation → common cis- regulatory promoter elements Identification of co-expressed genes using microarray technology (clustering) Computational identification of cis-regulatory elements that are over-represented in promoters of the co-expressed genes

31 PRIMA – General description PRIMA identifies transcription factors (TFs) whose binding sites (BSs) are enriched in a given ‘target set’ of promoters with respect to a ‘background set’ of promoters. Required ‘databases’:  Promoter sequences on a genomic scale  ‘Models’ for binding sites recognized by TFs Implemented in Expander.

32 Allegro Amadeus - Allegro Cluster I Cluster II Cluster III Gene expression microarrays Clustering Promoter sequences Expression data Output Motif(s) Co-regulated gene set

33 Task III : Simultaneous inference of motifs & their associated expression profiles Input: Genome-wide expression profiles Allegro Motif scoring algorithm: Allegro (A Log-Likelihood based mEthod for Gene expression Regulatory motifs Over-representation discovery)  Generalization of single condition analysis  Outline: Learns expression model that describes the expression pattern of the motif’s putative targets The motif is scored for over-representation in the set of genes whose expression profiles match the expression model

34 Allegro Allegro: expression model Discretization of expression values e 1 =Up (U) e 2 =Same (S) e 3 =Down (D) ≥1.0 (-1.0, 1.0) ≥-1.0 cmcm …c2c2 c1c1 1.5-0.8-2.3g cmcm …c2c2 c1c1 U…SDg Expression pattern Discrete expression Pattern (DEP) Expression data should be (partially) pre-processed, e.g.: Time series → log ratio relative to time 0 Several tissues/mutations/… → standardization Do NOT filter out non-responsive genes Expression model: CWM = Condition Weight Matrix Non-parametric, log-likelihood based model, analogous to PWM for sequence motifs Sensitive, robust against extreme values, performs well in practice

35 Allegro Allegro: expression model Discretization of expression patterns e 1 =Up (U) e 2 =Same (S) e 3 =Down (D) ≥1.0 (-1.0, 1.0) ≥-1.0 cmcm …c2c2 c1c1 1.5-0.8-2.3g cmcm …c2c2 c1c1 U…SDg Expression pattern Discrete expression Pattern (DEP) Condition frequency matrix (CFM) Condition weight matrix (CWM) cmcm …c2c2 c1c1 0.78…0.10.05U 0.14…0.20.9S 0.08…0.70.05D ( R={r ij } is the BG CFM)  Log-likelihood ratio (LLR) score

36 Features of the CWM expression model Analogous to PWM for sequence motifs Non-parametric: Does not assume a specific type of distribution (e.g., Gaussian) for expression values Robust against extreme values Sensitivity:  Can describe expression profiles that differ from the BG only in a small subset of the conditions  Can describe the regulatory effect of TFs that act both as repressors and activators in the same condition Performance: Describes known modules (GO, ChIP- chip targets) better than commonly used metrics – Pearson/Spearman correlation, Euclidean distance

37 Allegro Allegro overview

38 Learning a CWM of a motif CWM training set Microarrays genes Motif CWM Motif enrich. p-value 2.0E-5 3.5E-7 1.5E-9 Expression LLR High Low Motif hits Motif target genes Cross-validation-like procedure to avoid overfitting

39 Compute expression LLR of all genes Input: (A) CWM F (w) (B) Discretized genome-wide expression profiles g 1 : UUSD g 2 : UDSU g 3 : UDSU |G||G| |C||C| p 1 : UUSD p 2 : UDSU |C||C| |P||P| UUSD UDSUDDSS UDSD 1 2 2 2 3 1 (1)(2)(3) Example (Mouse TLRs dataset): |G|=~10000 |P|=1442 |C|=38 =1.6 Min. spanning tree

40 Human cell cycle [Whitfield et al., ’02] Large dataset: ~15,000 genes, 111 conditions, promoters region: -1000…200 bps 1.3E-19 6.6E-18 3.9E-15 E2F CHR CCAAT box p-value G 1 /S+S G 2 +G 2 /M Allegro Allegro recovers the major regulators of the human cell cycle [Elkon et al. ’03; Tabach et al. ’05; Linhart et al. ’05].

41 Yeast HOG pathway [O’Rourke et al. ’04] Allegro can discover multiple motifs with diverse expression patterns, even if the response is in a small fraction of the conditions Extant two-step techniques recovered only 4 of the above motifs:  K-means/CLICK + Amadeus/Weeder: RRPE, PAC, MBF, STRE  Iclust + FIRE: RRPE, PAC, Rap1, STRE ~6,000 genes, 133 conditions

42

43 Yeast HOG pathway: Comparison with the two-step pipline Biological process Motif/TF K-means / CLICK Iclust Allegro Amadeus / Weeder FIRE General stress response RRPE+ + + PAC+ + + Rap1- + + HOG and pheromone response pathways Sko1- - + Ste12- - + MBF+-+ Smp1- - - Skn7- - - General stress response and HOG pathway STRE+++

44 Immune response induced by Toll-like receptors ~10000 genes, 38 conditions Our findings from [Elkon et al. ’07] were recovered 2.0E-22 4.2E-17 2.8E-17 p ISRE NF-κB E2F

45 3’ UTR analysis: Human stem cells [Mueller ’08] ~14,000 genes, 124 conditions (various types of proliferating cells) Biases in length / GC-content of 3’ UTRs, e.g.: 100 highly-expressed genes in…3’ UTR: length GC Embryoid bodies 584 47% Undifferentiated ESCs 774 44% ESC-derived fibroblasts 1240 39% Fetal NSCs 1422 43% ( ESCs = embryonic stem cells, NSCs = neural stem cells) Extant methods / Allegro with HG score: report only false positives

46 Human stem cells: results using binned score Most highly expressed miRNAs in human/mouse ESCs miRNA expression targets expression Current knowledge Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis miRNA expression from [Laurent ’08]

47 C. elegans germline dataset [Reinke et al. ’03] ~12,000 genes in 20 different conditions Hermaphrodite development Mutants Germline Hermaphrodite Oogenesis Adult hermaphrodite Somatic Male Spermatogenesis L2-L3 hermaphrodite vs. Co-occurrence p=1.3E-54

48 Motif pair features (I) Co-occur on the same strand (112 genes vs. 53, p=2.5E- 6 ) Order-bias (104 genes vs. 8. p=1E-22) Distance-bias (p=1.12E-34 ) Gap not conserved. Short flanking regions are conserved over-represented in chromosome I (p=1.6E-8) and under- represented in chromosome X (p=1E-4) GO Enrichment: embryonic development (sensu Metazoa) (p=1.4E-11), reproduction (p=1.1E-8), hermaphrodite genitalia development (p=4.9E-5), etc.

49 Motif pair features (II) Motif pair is specific to the Caenorhabditis genus

50 Amadeus/Allegro - Additional features Motif pairs analysis Joint analysis of multiple datasets Evaluation of motifs using several scores Bootstrapping – get fixed p-value Sequence redundancy elimination – ignore sequences with long identical subsequence User-friendly and informative (most tools are textual and supply limited information!) Z

51 Co-occurrence of motif pairs After postprocess phase T - target set. t 1, t 2 - target genes that contain hit of the first and second motif, respectively. t 12 - target genes that contain hits for both motifs Elkon et al., ‘03 PWMs and their cutoffs are tuned to optimize the score

52 Combining p-values: the weighted Z- transform Input: p-values from k independent test H 0 : all the p-values are uniformly distributed transform P i into standard normal deviates Z i Combined p-value =

53 Allegro Allegro is available at: “Allegro: Analyzing expression and sequence in concert to discover regulatory programs”, Y. Halperin*, C. Linhart*, I. Ulitsky, R. Shamir, Nucleic Acids Research, 2009 (*equal contribution) http://acgt.cs.tau.ac.il/allegro

54 Protein binding microarray

55 And this is how the data looks like…

56 Amadeus-PBM An extension of Amadeus to PBM data. The general scheme is: 1. Rank all 9-mers according to average binding intensity. 2. Provide the top 500 to Amadeus to find motif of length 8.

57 Benchmark Success rate = rate of PWMs under the threshold of Euclidean distance. Average running time in seconds.

58 Summary Developed Amadeus motif discovery platform: Broad range of applications: -Target gene set -Spatial features (sequence only) -Expression analysis - Allegro Sensitive & efficient Easy to use, feature-rich, informative New over-representation score to handle biases in length/GC-content of sequences Novel expression model - CWM Constructed a large, real-life, heterogeneous benchmark for testing motif finding tools

59 Acknowledgements Tel-Aviv University Chaim Linhart Yonit Halperin Igor Ulitsky Adi Maron-Katz Ron Shamir The Hebrew University of Jerusalem Gidi Weber Handout: Section 1 and 2 C:\Program Files\ Amadeus_May19_2013


Download ppt "Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir."

Similar presentations


Ads by Google