Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
The AMADEUS Motif Discovery Platform C. Linhart, Y. Halperin, R. Shamir Tel-Aviv University ApoSys workshop May ‘ 08 Genome Research 2008.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Improving miRNA Target Genes Prediction Rikky Wenang Purbojati.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
EXPression ANalyzer and DisplayER Adi Maron-Katz Igor Ulitsky Chaim Linhart Amos Tanay Rani Elkon Seagull Shavit Dorit Sagir Eyal David Roded Sharan Israel.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
APO-SYS workshop on data analysis and pathway charting Igor Ulitsky Ron Shamir ’ s Computational Genomics Group.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Heuristic alignment algorithms and cost matrices
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Human: 78 tissues (Su et al, 2004) Stastical significance P. falciparum: intra-erythrocytic development cycle Yeast: 78 co-expression clusters From k-mers.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008.
MOTIF ENRICHMENT ANALYSIS IN CO- EXPRESSED GENE SETS AND HIGH- THROUGHPUT SEQUENCE SETS Wyeth Wasserman Jan. 18, 2012 opossum.cisreg.ca/oPOSSUM3.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.
EXPression ANalyzer and DisplayER Adi Maron-Katz Igor Ulitsky Chaim Linhart Amos Tanay Rani Elkon Seagull Shavit Dorit Sagir Eyal David Roded Sharan Israel.
EXPression ANalyzer and DisplayER Adi Maron-Katz Igor Ulitsky Chaim Linhart Amos Tanay Rani Elkon Seagull Shavit Dorit Sagir Eyal David Roded Sharan Israel.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
De novo Motif Finding using ChIP-Seq
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Predicting Gene Expression from Sequence
Presentation transcript:

Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir

Gene expression regulation Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (miRNAs)

TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A,T}, B={C,G,T}) PWM = Position weight matrix A C G T  Cutoff = AGCTACACCCATTTAT 0.06 AGTAGAGCCTTCGTG 0.06 CGATTCTACAATATGA 0.01 ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA

Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery: The typical two-step pipeline Promoter/3’UTR sequences Motif discovery Co-regulated gene set

Motif discovery: Goals and challenges Goal: Reverse-engineer the transcriptional regulatory network Challenges:  BSs are short and degenerate (non-specific)  Promoters are long + complex (hard to model)  Search space is huge (motif and sequence)  Data is noisy  What to look for? (enriched?, localized?, conserved?) Problem is still considered very difficult despite extensive research [Tompa ’05]

Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? AA general pipeline architecture for enumerating motifs. DDifferent statistical scoring schemes of motifs for different motif discovery tasks.

Motif search algorithm Pipeline of refinement phases Each phase receives best candidates of previous phase, and refines them First phases are simple and fast (e.g., try all k-mers); Last phases are more complex (e.g., optimize PWM) k-mer Preprocess Mismatch List of k-mers Merge PWM Optimization Cutoff = PWM  Motif Model:  Phases:

PWM optimization phase

Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

Input: Target set (T) = co-regulated genes Background (BG) set (B) = entire genome No sequence model is assumed! Motif scoring: Hypergeometric (HG) enrichment score b, t = BG/Target genes containing a hit B T Task I: Over-represented motifs in given target set b t ! BG set should be of the same “nature” as the target set, and much larger E.g., all genes on microarray

Input: Target set (T) = co-regulated genes Background (BG) set (B) = entire genome No sequence model is assumed! Motif scoring: Hypergeometric (HG) enrichment score b, t = BG/Target genes containing a hit B T Task I: Over-represented motifs in given target set b t

Drawback of the HG score Length/GC-content distribution in the target set might significantly differ from the distribution in the BG set  Very common in practice due to correlation between the expression/function of genes and the length/GC-content of their promoters and 3’ UTRs  The HG score might fail to discover the correct motif or detect many spurious motifs → Use the binned enrichment score  Slightly less sensitive than HG score…  … but takes into account length/GC-content biases

Drawback of the HG score Length/GC-content distribution in the target set might significantly differ from the distribution in the BG set Very common in practice due to correlation between the expression/function of genes and the length/GC-content of their promoters and 3’ UTRs H 0 assumes uniform sampling The HG score might fail to discover the correct motif or detect many spurious motifs

p m = prob. of a target set gene to contain a hit Binned enrichment score Key idea: Binning sequences  B i, T i = BG/Target genes in i-th bin  b i =motif hits in i-th bin. t = b∩T  Bins sampling probability :  Assume uniform sampling per bin: Assume that |T| target genes are sampled with replacement from B Length GC-content B1B1 T1T1 b1b1 B2B2 b2b2 B3B3 T3T3 b3b3 T4T4 b4b4 T2T2 B4B % 40-60% kbp 0.7-1kbp

Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02] Test case: Human G2+M cell-cycle genes CHR NF-Y (CCAAT-box) These motifs form a module associated with G 2 +M [Elkon et al. ’03,Tabach et al. ’05, Linhart et al. ’05] Pairs analysis

Results: Human G 2 +M cell-cycle genes ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02]. CHR CCAAT-box Both motif are associated with G 2 +M [Elkon et al. ’03,Tabach et al. ’05, Linhart et al. ’05].

Benchmark I: Yeast TF target sets Source: ChIP-chip [Harbison et al., ’04] Data: 173 target-sets of 83 TFs with known BS motifs Average set size: 58 genes (=35 Kbps) Success rates: (for top 2 motifs of lengths 8 & 10)

Benchmark: Real-life metazoan datasets We constructed the first motif discovery benchmark that is based on a large compendium of experimental studies Source: Various (expression, ChIP-chip, Gene Ontology, …) Data: 42 target-sets of 26 TFs and 8 miRNAs from 29 publications Species: human, mouse, fly, worm Average set size: 400 genes (=383 Kbps) Binned score improvement

Similarity between two motifs Euclidean: Pearson correlation coefficient: Kullback-Leibler divergence (relative entropy):

Metazoan benchmark: Other assessment methods for success rate

Metazoan benchmark: Detailed results

Binned score - examples Mef2 fly target-set [Blais et al. ’05]  Promoters longer than average (972bp vs. 840bp)  Promoter have higher GC-content (53% vs. 49%)  None of the programs discovered the correct motif  Binned score -> Mef2 is the top-scoring motif hsa-miR-16 target-set [Linsley et al. ’07]  3’UTRs longer than average (~1700bp vs. ~960)  HG score: hsa-miR-16 signature is the top scoring motif. But 10 more motifs with p<1E-14.  Binned score: the correct motif p=1.7E-33. No spurious motifs. 198 mouse odorant receptors promoters [Michaloski et al. 06’]  highly AT-rich (35% vs. ~25% in the BG)  HG score: Olf-1 was the third best motif after AT-rich motifs  Binned score: Olf-1 top scoring motifs

Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

Amadeus – Global spatial analysis Promoter sequences Output Motif(s) Gene expression microarrays Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Co-regulated gene set

Task II : Global analyses Localization w.r.t the TSS Strand-bias Chromosomal preference TSS 5’ Scores for spatial features of motif occurrences Input: Sequences (no target-set / expression data) Motif scoring:

Global analysis I: Localized human + mouse motifs Input: All human & mouse promoters (2 x ~20,000) Score: localization

Global analysis II: Chromosomal preference in C. elegans Input: All worm promoters (~18,000) Score: chromosomal preference Results: Novel motif on chrom IV

Amadeus is available at: “Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets”, C. Linhart*, Y. Halperin*, R. Shamir, Genome Research 18:7, 2008 (*equal contribution)

Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How?  A general pipeline architecture for enumerating motifs.  Different statistical scoring schemes of motifs for different motif discovery tasks.

PRIMA – GOAL: ‘ Reverse engineering ’ of transcriptional networks Co-expression → Co-regulation → common cis- regulatory promoter elements Identification of co-expressed genes using microarray technology (clustering) Computational identification of cis-regulatory elements that are over-represented in promoters of the co-expressed genes

PRIMA – General description PRIMA identifies transcription factors (TFs) whose binding sites (BSs) are enriched in a given ‘target set’ of promoters with respect to a ‘background set’ of promoters. Required ‘databases’:  Promoter sequences on a genomic scale  ‘Models’ for binding sites recognized by TFs Implemented in Expander.

Allegro Amadeus - Allegro Cluster I Cluster II Cluster III Gene expression microarrays Clustering Promoter sequences Expression data Output Motif(s) Co-regulated gene set

Task III : Simultaneous inference of motifs & their associated expression profiles Input: Genome-wide expression profiles Allegro Motif scoring algorithm: Allegro (A Log-Likelihood based mEthod for Gene expression Regulatory motifs Over-representation discovery)  Generalization of single condition analysis  Outline: Learns expression model that describes the expression pattern of the motif’s putative targets The motif is scored for over-representation in the set of genes whose expression profiles match the expression model

Allegro Allegro: expression model Discretization of expression values e 1 =Up (U) e 2 =Same (S) e 3 =Down (D) ≥1.0 (-1.0, 1.0) ≥-1.0 cmcm …c2c2 c1c g cmcm …c2c2 c1c1 U…SDg Expression pattern Discrete expression Pattern (DEP) Expression data should be (partially) pre-processed, e.g.: Time series → log ratio relative to time 0 Several tissues/mutations/… → standardization Do NOT filter out non-responsive genes Expression model: CWM = Condition Weight Matrix Non-parametric, log-likelihood based model, analogous to PWM for sequence motifs Sensitive, robust against extreme values, performs well in practice

Allegro Allegro: expression model Discretization of expression patterns e 1 =Up (U) e 2 =Same (S) e 3 =Down (D) ≥1.0 (-1.0, 1.0) ≥-1.0 cmcm …c2c2 c1c g cmcm …c2c2 c1c1 U…SDg Expression pattern Discrete expression Pattern (DEP) Condition frequency matrix (CFM) Condition weight matrix (CWM) cmcm …c2c2 c1c1 0.78… U 0.14…0.20.9S 0.08… D ( R={r ij } is the BG CFM)  Log-likelihood ratio (LLR) score

Features of the CWM expression model Analogous to PWM for sequence motifs Non-parametric: Does not assume a specific type of distribution (e.g., Gaussian) for expression values Robust against extreme values Sensitivity:  Can describe expression profiles that differ from the BG only in a small subset of the conditions  Can describe the regulatory effect of TFs that act both as repressors and activators in the same condition Performance: Describes known modules (GO, ChIP- chip targets) better than commonly used metrics – Pearson/Spearman correlation, Euclidean distance

Allegro Allegro overview

Learning a CWM of a motif CWM training set Microarrays genes Motif CWM Motif enrich. p-value 2.0E-5 3.5E-7 1.5E-9 Expression LLR High Low Motif hits Motif target genes Cross-validation-like procedure to avoid overfitting

Compute expression LLR of all genes Input: (A) CWM F (w) (B) Discretized genome-wide expression profiles g 1 : UUSD g 2 : UDSU g 3 : UDSU |G||G| |C||C| p 1 : UUSD p 2 : UDSU |C||C| |P||P| UUSD UDSUDDSS UDSD (1)(2)(3) Example (Mouse TLRs dataset): |G|=~10000 |P|=1442 |C|=38 =1.6 Min. spanning tree

Human cell cycle [Whitfield et al., ’02] Large dataset: ~15,000 genes, 111 conditions, promoters region: -1000…200 bps 1.3E E E-15 E2F CHR CCAAT box p-value G 1 /S+S G 2 +G 2 /M Allegro Allegro recovers the major regulators of the human cell cycle [Elkon et al. ’03; Tabach et al. ’05; Linhart et al. ’05].

Yeast HOG pathway [O’Rourke et al. ’04] Allegro can discover multiple motifs with diverse expression patterns, even if the response is in a small fraction of the conditions Extant two-step techniques recovered only 4 of the above motifs:  K-means/CLICK + Amadeus/Weeder: RRPE, PAC, MBF, STRE  Iclust + FIRE: RRPE, PAC, Rap1, STRE ~6,000 genes, 133 conditions

Yeast HOG pathway: Comparison with the two-step pipline Biological process Motif/TF K-means / CLICK Iclust Allegro Amadeus / Weeder FIRE General stress response RRPE+ + + PAC+ + + Rap HOG and pheromone response pathways Sko Ste MBF+-+ Smp Skn General stress response and HOG pathway STRE+++

Immune response induced by Toll-like receptors ~10000 genes, 38 conditions Our findings from [Elkon et al. ’07] were recovered 2.0E E E-17 p ISRE NF-κB E2F

3’ UTR analysis: Human stem cells [Mueller ’08] ~14,000 genes, 124 conditions (various types of proliferating cells) Biases in length / GC-content of 3’ UTRs, e.g.: 100 highly-expressed genes in…3’ UTR: length GC Embryoid bodies % Undifferentiated ESCs % ESC-derived fibroblasts % Fetal NSCs % ( ESCs = embryonic stem cells, NSCs = neural stem cells) Extant methods / Allegro with HG score: report only false positives

Human stem cells: results using binned score Most highly expressed miRNAs in human/mouse ESCs miRNA expression targets expression Current knowledge Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis miRNA expression from [Laurent ’08]

C. elegans germline dataset [Reinke et al. ’03] ~12,000 genes in 20 different conditions Hermaphrodite development Mutants Germline Hermaphrodite Oogenesis Adult hermaphrodite Somatic Male Spermatogenesis L2-L3 hermaphrodite vs. Co-occurrence p=1.3E-54

Motif pair features (I) Co-occur on the same strand (112 genes vs. 53, p=2.5E- 6 ) Order-bias (104 genes vs. 8. p=1E-22) Distance-bias (p=1.12E-34 ) Gap not conserved. Short flanking regions are conserved over-represented in chromosome I (p=1.6E-8) and under- represented in chromosome X (p=1E-4) GO Enrichment: embryonic development (sensu Metazoa) (p=1.4E-11), reproduction (p=1.1E-8), hermaphrodite genitalia development (p=4.9E-5), etc.

Motif pair features (II) Motif pair is specific to the Caenorhabditis genus

Amadeus/Allegro - Additional features Motif pairs analysis Joint analysis of multiple datasets Evaluation of motifs using several scores Bootstrapping – get fixed p-value Sequence redundancy elimination – ignore sequences with long identical subsequence User-friendly and informative (most tools are textual and supply limited information!) Z

Co-occurrence of motif pairs After postprocess phase T - target set. t 1, t 2 - target genes that contain hit of the first and second motif, respectively. t 12 - target genes that contain hits for both motifs Elkon et al., ‘03 PWMs and their cutoffs are tuned to optimize the score

Combining p-values: the weighted Z- transform Input: p-values from k independent test H 0 : all the p-values are uniformly distributed transform P i into standard normal deviates Z i Combined p-value =

Allegro Allegro is available at: “Allegro: Analyzing expression and sequence in concert to discover regulatory programs”, Y. Halperin*, C. Linhart*, I. Ulitsky, R. Shamir, Nucleic Acids Research, 2009 (*equal contribution)

Protein binding microarray

And this is how the data looks like…

Amadeus-PBM An extension of Amadeus to PBM data. The general scheme is: 1. Rank all 9-mers according to average binding intensity. 2. Provide the top 500 to Amadeus to find motif of length 8.

Benchmark Success rate = rate of PWMs under the threshold of Euclidean distance. Average running time in seconds.

Summary Developed Amadeus motif discovery platform: Broad range of applications: -Target gene set -Spatial features (sequence only) -Expression analysis - Allegro Sensitive & efficient Easy to use, feature-rich, informative New over-representation score to handle biases in length/GC-content of sequences Novel expression model - CWM Constructed a large, real-life, heterogeneous benchmark for testing motif finding tools

Acknowledgements Tel-Aviv University Chaim Linhart Yonit Halperin Igor Ulitsky Adi Maron-Katz Ron Shamir The Hebrew University of Jerusalem Gidi Weber Handout: Section 1 and 2 C:\Program Files\ Amadeus_May19_2013