Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Two short pieces MicroRNA Alternative splicing.
Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Speaker: HU Xue-Jia Supervisor: WU Yun-Dong Date: 19/12/2013.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
A turbo intro to (the bioinformatics of) microRNAs 11/ Peter Hagedorn.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Presenting: Asher Malka Supervisor: Prof. Hermona Soreq.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Whole Genome Polymorphism Analysis of Regulatory Elements in Breast Cancer AAGTCGGTGATGATTGGGACTGCTCT[C/T]AACACAAGCGAGATGAAGAAACTGA Jacob Biesinger Dr.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
P300 Marks Active Enhancers Ruijuan LiChao HeRui Fu.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and.
Igor Ulitsky.  “the branch of genetics that studies organisms in terms of their genomes (their full DNA sequences)”  Computational genomics in TAU ◦
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
Manolis Kellis modENCODE analysis group January 11, 2007 Part 1: Target identification: comparative vs. exprmt. (really the topic for today) Part 2: Target.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Integrative fly analysis: specific aims Aim 1: Comprehensive data collection – Data QC / data standards / – consistent pipelines Aim 2: Integrative annotation.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CSLS Retreat 2007 Matan Hofree & Assaf Weiner 1. Outline  A brief introduction to microRNA  Project motivation and goal  Selecting the data sets 
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Finding genes in the genome
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Regulation of Gene Expression
The Transcriptional Landscape of the Mammalian Genome
De novo Motif Finding using ChIP-Seq
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Recitation 7 2/4/09 PSSMs+Gene finding
In collaboration with Mikkelsen Lab
Volume 11, Issue 7, Pages (May 2015)
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis

Background and goal TF1microRNA1TF2 Regulators bind to short (5 to 20bp) sequence specific patterns (motifs) Genes are largely controlled through the binding of regulators –Transcription factors (TFs) are proteins that bind near the transcription start site (TSS) of genes and either activate or repress transcription –miRNAs bind to the 3’ un-translated region (UTR) of mRNAs to repress translation The goal of our work is to identify these binding sites (motif instances)

Motivation Network: Davidson and Erwin, Science (2006) Mouse: Pennacchio, et al., Nature (2006) Fly: Tomancak, et al., Genome Biology (2002) In all animals, genes are both temporally and spatially regulated to produce complex expression patterns Identifying the targets of regulators is vital to understanding this expression Conservation allows for identifying targets that are evolutionarily meaningful

Previous work Single genome approaches –Generally use positional clustering of motif matches to increase signal (e.g. Berman, et al. 2002; Schroeder, et al. 2004; Philippakis, et al. 2006) A single 5mer match occurs on average 3 million times in mammalian genome –Requires set of specific factors that act together –Miss instances of motifs that may occur alone Multi-genome approaches (phylogentic footprinting) –Blanchette and Tompa 2002 use an alignment free phylogenetic approach to find k-mers that are unusually well conserved –Moses, et al use a strict phylogenetic model to find regions that evolve according to the motif and not the background –Etwiller, et al use both nearby species and distant species (fish) to identify motif instances –Lewis, et al finds putative microRNA binding sites requiring full conservation in five species

Approach outline 1.Produce a raw conservation score for each motif match (branch length score or BLS) 2.For each motif and region, produce a mapping from BLS to confidence Advantages Now we have many, complete, closely related genomes –Gives enough power to identify binding sites (Eddy, 2005) –Do not have to worry about dramatic divergence Account for non-motif conservation using globally derived statistics Robust against errors and evolutionary turnover Computationally feasible to run genome wide for all available motifs

Large phylogeny challenges in instance identification Sequencing / assembly / alignment artifacts –Low coverage sequencing, mis-alignments Evolutionary variation –Individual binding sites can move / mutate –Some instances found only in subset of species Don’t require perfect conservation:  Branch length score Don’t require exact alignment:  Search within a window Motif instance movement missing sequence

Computing Branch Length Score (BLS) CTCF BLS = 2.23 sps (78%) Does not over count redundant branch length Allows for: 1.Mutations permitted by motif degeneracy 2.Misalignment/movement of motifs within window (up to hundreds of nucleotides) 3.Missing motif matches in dense species tree mutations missing short branches movement

Branch Length Score  Confidence 1.Evaluate non-motif probability of a given score Sequence could also be conserved due to overlap with un-annotated element (e.g. non-coding RNA) 2.Account for differences in motif composition and length For example, short motif more likely to be conserved by chance

Control motifs Control motifs are the basis of our estimation of the background level of conservation and for evaluating enrichment Each motif has its own set of controls They are chosen to: –Have the same composition as the original motif –Match the target regions (e.g. promoters) with approximately the same frequency (+/- 20%) –Not too similar to each other (to preserve diversity) –Not be similar to known motifs (including the one being shuffled) Background level is estimated separately in each region type (e.g. Promoters or 3’ UTRs)

Branch Length Score  Confidence 1.Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone or due to non-motif conservation 2.Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate) 3.Select movement window that leads to the most instances at each confidence

Confidence selects for functional instances Transcription factor motifs Promoter 5’UTR CDS Intron 3’UTR MicroRNA motifs Promoter 5’UTR CDS Intron 3’UTR 1.Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs

Confidence selects for functional instances 1.Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs 2.miRNA motifs are found preferentially on the plus strand, whereas no such preference is found for TF motifs Strand Bias

Experimental identification of binding sites Chromatin immunoprecipitation (ChIP) combined with either sequencing (seq) or with microarrays (chip) are experimental procedures that are used to identify binding sites –Not all binding is functional, can have high false positive rate –Only binding that is active in the surveyed conditions is found ChIP-seq Maridis 2007

Intersection with CTCF ChIP-Seq regions Conserved CTCF motif instances highly enriched in ChIP-Seq sites High enrichment does not require low sensitivity Many motif instances are verified ChIP data from Barski, et al., Cell (2007) ≥ 50% of regions with a motif 50% motifs verified 50% confidence CTCF

Enrichment found for other factors in mammals and flies Barski, et al., Cell (2007) Odom, et al., Nature Genetics (2007) Lim, et al., Molecular Cell (2007) Wei, et al., Cell (2006) Zeller, et al., PNAS (2006)Lin, et al., PLoS Genetics (2007) Robertson, et al., Nature Methods (2006) Mammals Abrams and Andrew, Devel (2005) (Not ChIP)Sandmann, et al., Devel Cell (2006) Zeitlinger, et al., Genes & Devel (2007) Sandmann, et al., Genes & Devel (2007) Flies

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished 1.ChIP bound regions may not be conserved (Odom, et al. 2007) 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished Odom, et al., Nature Genetics (2007) 1.ChIP bound regions may not be conserved (Odom, et al. 2007) 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher 4.Trend persists for other factors where we have multi-species ChIP data

1.Motifs at 60% confidence and ChIP have similar enrichments (depletion for the repressor Snail) in the functional promoters 2.Enrichments persist even when you look at non-overlapping subsets 3.Intersection of two has strongest signal 4.Evolutionary and experimental evidence is complementary ChIP includes species specific regions and differentiates tissues Conserved instances include binding sites not seen in tissues surveyed ChIP data from: Zeitlinger, et al., G&D (2007); Sandmann, et al,. G&D (2007); Sandmann, et al., Dev Cell (2006) Enrichment of instances in fly muscle genes

Fly regulatory network at 60% confidence TFs: 67 of 83 (81%) 46k instances miRNAs: 49 of 67 (86%) 4k instances Several connections confirmed by literature (either directly or indirectly) Global view of instances allows us to make network level observations: TFs were more targeted by TFs (P < ) and by miRNAs (P < 5 x ) TF in-degree associated with miRNA in-degree (high-high: P < ; low-low P < )

Contributions A general methodology for regulatory motif instance identification using many, closely related genomes –Robust against errors from sequencing, assembly and alignment –Allows limited functional turnover and motif movement –Provides statistical measurement of confidence for each instance, correcting for length, composition and overlap with other functional elements Validation and comparison to experimental data –High enrichment of binding sites in ChIP regions for a variety of factors –Functional enrichments suggest comparable ability to identify functional instances as ChIP

Future directions Our predicted network was static, but real regulatory networks are dynamic –They change throughout development and in different conditions –They can vary greatly in different species We want to expand this work to learn about this network dynamics –ChIP data is becoming increasingly available in a variety of conditions – we can use this to learn what causes changes in binding –Multi-species data is also becoming more available Can match motif binding to cross-species expression changes –We can train on this data to find motifs that act together or compensate for each other

Acknowledgments Alexander Stark Sushmita Roy Manolis Kellis Mouse CTCF ChIP-Seq Tarjei Mikkelsen Brad Bernstein Funding William C.H. Chao Fellowship NSF Graduate Research Fellowship MIT CSAIL Matt Rasmussen Mike Lin Issao Fujiwara Rogerio Candeias Broad Institute Or Zuk Michele Clamp Manuel Garber Mitch Guttman Eric Lander

The End

Implementation details Table lookup on the next 8 bases of the genome are used to find potential matches to the target genome –Results in an order-of-magnitude increase in speed over scanning through all motifs In a first run, 100 shuffles of each motif are evaluated and up to 10 that fulfill the requirements are selected All motifs and their selected shuffles are matched to the target genome and their BLS scores are computed The matches are evaluated at each branch length cutoff and a mapping is produced for each motif from branch length score to confidence All code is designed to run on BROAD cluster (often with parallelization) and is written in C

Performance on mammalian TRANSFAC motifs Most motifs have confident instances into 90% confidence with 18 mammals Substantial increase in the number of instances compared to only human, mouse rat and dog. 2.5x increase 3.5x 6.5x

The promise of many genomes Eddy showed that with many genomes, resolving binding sites using conservation is possible The goal of our work is to make this practical –Integrate evidence from multiple informant species –Determine which of the thousands of motif matches are functional using conservation

Slides on motif discovery

Related problem: computational motif discovery Discovery of the regulatory motifs (as opposed to their binding sites) has also been an active area of research for several years Single species work has generally required sequences thought to have similar regulation (for comparison, see Tompa, et al. 2005; Elemento, et al. 2007) –Looked for patterns that were enriched in target sequences Use of conservation has been generally successful in re- identifying known binding affinities for TFs and miRNAs (e.g. Kellis, et al. 2003; Xie, et al. 2005; Etwiller, et al. 2005) –Requires fewer species (i.e. less branch length) than instance identification because signal can be integrated over thousands of instances found genome-wide

Motif discovery pipeline 1.Enumerate motif seeds Six non-degenerate characters with variable size gap in the middle 2.Score seed motifs Use a conservation ratio corrected for composition and small counts to rank seed motifs 3.Expand seed motifs Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing 4.Cluster to remove redundancy Using sequence similarity GTC AGT gap GTC AGT R R Y S W

ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en) TTKCAATTAA57.3reversed-polarity (repo) WATTRATTK54.9araucan (ara) AAATTTATGCK54.4paired (prd) GCAATAAA51ventral veins lacking (vvl) DTAATTTRYNR46.7Ultrabithorax (Ubx) TGATTAAT45.7apterous (ap) YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT RATTKAATT GCACGTGT39.5fushi tarazu (ftz) AACASCTG38.8broad-Z3 (br-Z3) AATTRMATTA TATGCWAAT TAATTATG37.5Antennapedia (Antp) CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG33.8Abdominal B (Abd-B) TTTYMATTA33.6extradenticle (exd) TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT32.9gooseberry-neuro (gsb-n) TTATTTAYR32.9Deformed (Dfd)30.7 Top 30 discovered fly motifs 1.Many of the top discovered motifs match known motifs 2.Motifs are associated with genes that are preferentially expressed in tissues

Discovered motifs have functional enrichments 1.Most motifs avoided in ubiquitously expressed genes 2. Functional clusters emerge Tissues Motifs Enrichment or depletion of a motif in the promoters of genes expressed in a tissue