The AMADEUS Motif Discovery Platform C. Linhart, Y. Halperin, R. Shamir Tel-Aviv University ApoSys workshop May ‘ 08 Genome Research 2008.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
APO-SYS workshop on data analysis and pathway charting Igor Ulitsky Ron Shamir ’ s Computational Genomics Group.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008.
MOTIF ENRICHMENT ANALYSIS IN CO- EXPRESSED GENE SETS AND HIGH- THROUGHPUT SEQUENCE SETS Wyeth Wasserman Jan. 18, 2012 opossum.cisreg.ca/oPOSSUM3.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Fuzzy K means.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Finding Regulatory Motifs in DNA Sequences
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Biological Motivation Gene Finding in Eukaryotic Genomes
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Cis-regulatory element study in transcriptome Jin Chen CSE Fall
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.
Igor Ulitsky.  “the branch of genetics that studies organisms in terms of their genomes (their full DNA sequences)”  Computational genomics in TAU ◦
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013.
Discovering cis-regulatory motifs using genome-wide sequences and expression Yaron Orenstein, Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir.
Motif discovery and Protein Databases Tutorial 5.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Tools for Comparative Sequence Analysis Ivan Ovcharenko Lawrence Livermore National Laboratory.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Motif Search and RNA Structure Prediction Lesson 9.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Accessing and visualizing genomics data
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
Recitation 7 2/4/09 PSSMs+Gene finding
Presentation transcript:

The AMADEUS Motif Discovery Platform C. Linhart, Y. Halperin, R. Shamir Tel-Aviv University ApoSys workshop May ‘ 08 Genome Research 2008

Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly (not always!) in the gene ’ s promoter – the DNA sequence upstream the gene ’ s transcription start site (TSS) TFs can promote or repress transcription Promoter Analysis: Exteremely brief intro TF Gene 5’5’ 3’3’ BS TSS

The BSs of a particular TF share a common pattern, or motif, which is often modeled using: –Consensus string TASDAC ( S ={ C,G } D ={ A,G,T }) –Position weight matrix (PWM / PSSM) Promoter Analysis (cont.) TFBS models A C G T > Threshold = 0.01: TACACC (0.06) TAGAGC (0.06) TACAAT (0.015) …

Promoter Analysis (cont.): Typical pipeline Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, … ) Functional group (e.g., GO term) Promoter sequences Motif discovery Co-regulated gene set

Reverse-engineer the transcriptional regulatory network = find the TFs (and their BSs) that regulate the studied biological process Input: A set of co-expressed genes Output: “ Interesting ” motif(s): 1. Known motifs: PRIMA, ROVER, … 2. Novel motifs: MEME, AlignACE, … 3. A group of co-occurring motifs = cis-regulatory module (CRM): MITRA, CREME, … Promoter Analysis (cont.): Goals AMADEUS

Why is it so difficult? BSs are short and degenerate (non-specific) Promoters are long + complex (hard to model): – Multiple BSs of several TFs – Old (non-functional) BSs – Other genetic/structural signals (e.g., GC content) Search space is huge: – (500 billion) consensus strings of length 10 – 1Kbp promoter  20K genes in human = 20 Mbps Which score to use - what makes a motif “ interesting ” ? –Enrichment: over-representation w.r.t. BG model –Location and/or strand bias –Conservation across related species Promoter Analysis (cont.): Challenges

Promoter Analysis (cont.): Challenges (II) Additional complications: alternative promoters, wrong TSS annotations, paralogs ( → dependencies), … Many TFs have BSs in distant upstream locations, as well as in introns, UTRs, … [Lin et al. ‘07]: Used ChIP-PET to identify BSs of ER-α in breast cancer cells. –Only 5% of BSs are within 5kb upstream of TSS! –Only 23% of the BSs are conserved among vertebrates, “which suggests limited conservation of functional binding sites”.

Promoter Analysis (cont.): Challenges (III) [Odom et al. ‘07]: Used ChIP-chip to map BSs of 4 TFs in human+mouse liver. Function and binding motifs are conserved 41-89% of BSs are species specific When a pair of orthologous genes contain a BS of the same TF, the BSs are aligned only in 1/3 of the cases

Extant tools perform reasonably well for: –Finding known/novel motifs in organisms with short, simple promoters, e.g., yeast –Identifying some of the known motifs in complex species, e.g., TFs whose BSs are usually close to the TSS … but often fail in other cases! Each tool is custom-built for a specific target score, often parametric (i.e., assumes a BG model) or uses a small part of the genome as BG reference; Majority of tools can efficiently handle only dozens of genes Comparison of tools: [Tompa et al. ’05] Promoter Analysis: Status of motif discovery tools

AMADEUS A Motif Algorithm for Detecting Enrichment in mUltiple Species Research platform: Extensible: add new algs, scores, motif models Flexible: control params, algs, scores of execution Experimental tool: Sensitive: find subtle signals Efficient: analyze many long sequences Informative: show lots of info on motifs User-friendly: nice GUI

Main features: I/O Input: Type: target set / expression data Multiple species / target-sets Sequence region (promoter, 1 st intron, 3 ’ UTR, … ) Output: Non-redundant set of motifs Rich info per output motif: 1.Graphical motif logo 2.Multiple scores & combined p-value 3.Similarity to known TFBS models 4.List of target genes 5.BS localization graph 6.Targets mean expression graph

Main features: alg. Algorithm: Multiple refinement phases: Each phase receives best candidates of previous phase, and refines them (e.g., uses a more complex motif model) First phases are simple and fast (e.g., try all k-mers); Last phases are more complex (e.g., optimize PWM using EM)

Main features: scores Motif scores: User selects scores to use, a subset of: ─Target-set: Over/under-representation: 1.Hypergeometric 2.GC-content+length binned binomial ─Expression: 1.Enrichment of ranked expression (multiple conditions) (Not yet in the public version) ─Global/spatial: 1.Localization 2.Strand-bias 3.Chromosomal preference Scores are combined into a single p-value Doesn ’ t assume specific models for distribution of BSs and/or expression values

Main features: misc. GUI: Control all parameters Save/load parameters from file Save textual+graphical output to file TFBS viewer Other: Ignore redundant sequences (with identical subsequence) Applicable to multiple genome-scale promoter sequences Bootstrapping: Empirical p-value estimation using random target sets / shuffled data Execution modes: GUI, batch Interoperability: Java application

Combining p-values Each motif receives p-values from various sources (several scores, multiple species): p 1,p 2, …,p n We combine them into a single p-value p: p = Prob { φ 1  φ 2  …  φ n  p 1  p 2  …  p n | φ i ~ U[0,1] } Denote:  = p 1  p 2  …  p n  p = 1 -    (ln 1/  ) i /i!, i=0,…,n-1 Also developed a weighted version when each p-value has a different weight

Results I: E2F targets [Ren et al. ’02] E2F NF-Y

Case study: G2 & G2/M phases of human cell cycle [Whitfield et al. ’02] CHR (not in TRANSFAC) NF-Y (Module was reported in [Linhart et al., ’ 05], [Tabach et al. ’ 05]) Module: CHR and NF-Y motifs co-occur

Benchmark I: Yeast TF target sets [Harbison et al. ’04] Source: ChIP-chip [Harbison et al., ’04] Data: target-sets of 83 TFs with known BS motifs Average set size: 58 genes (=35 Kbps) Success rates: (for top 2 motifs of lengths 8 & 10)

Performance on metazoan datasets Results on 42 target-sets: Collected from 29 publications Based on high-throughput expr ’ s Species: human, mouse, fly, worm Sets: 26 TFs, 8 microRNAs All have known motifs

Global Analysis I: Localized human+mouse motifs Input: All human & mouse promoters (2 x ~20,000) Region: -500 … 100 (w.r.t. TSS) Total sequence length: ~26 Mbps [No target-set / expression data] Score: localization Results: Recovered known TFs: Sp1, NF-Y, GABP, TATA, Nrf-1, ATF/CREB, Myc, RFX1 Recovered the splice donor site Identified several novel motifs

Input: All fly promoters (~14,000) Region: … 200 (w.r.t. TSS) Total sequence length: ~11 Mbps [No target-set / expression data] Score: chromosomal preference Results: DNA Replication Element Factor (DREF) on X chromosome Global Analysis II: Chromosomal preference

Global Analysis II: Chromosomal preference (cont.) Input: All worm promoters (~18,000) Region: -500 … 100 (w.r.t. TSS) Total sequence length: 6.6 Mbps [No target-set / expression data] Score: chromosomal preference Results: Novel motif on chrom IV

Summary Developed Amadeus motif discovery platform: Easy to use Feature-rich, informative Sensitive & efficient Constructed a large, real-life, heterogeneous benchmark for testing motif finding tools Demonstrated various applications of motif discovery

Acknowledgements Tel-Aviv University Chaim Linhart Yonit Halperin Ron Shamir The Hebrew University of Jerusalem Gidi Weber