1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Molecular Evolution Revised 29/12/06
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
(Regulatory-) Motif Finding
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Finding Regulatory Motifs in DNA Sequences
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Identifying conserved promoter motifs and transcription factor binding sites in plant promoters Endre Sebestyén, ARI-HAS, Martonvásár, Hungary 26th, November,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Regulation of Gene Expression
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical Models for Biological Sequence Motif Discovery, Liu J, Gupta, Liu X, Mayerhofere, Lawrence

2 “Regulatory Motif Finding”  What is being regulated?  What is a “Motif?”  Why do we want to find them?

3 Central Dogma of Genetics (pict by Andrew Hughes, Rice University)  It’s “TRUE,” right?!  Yes, but…

4 Every Protein in Every Cell?  Clearly, there are complicated mechanisms at work  Rhodopsin  But, we have the same DNA in all cells…

5 Transcriptional Regulation  It is transcription (DNA  RNA) that is being regulated.  RNA Polymerase II, aided by Transcription Factors (TFs)  Where do TFs bind?

6 Promoter Regions (pict by Andrew Hughes, Rice University)  TATA box – usually ~ 30 bp upstream of gene  But, there are others...Where? What Sequence?

7 Promoter Sequence  Many different possible locations, sometimes extremely far from the start of transcription!  What Sequence? THAT is the $64k (or $1B) Question…

8 Motifs  Many different promoter sequences found  Basal: TATA-box (-20), CCAAT-box (-100)  Additional transcriptional regulatory domains  Activators and inhibitors use these domains

9 Motifs (2)  Not exact sequences – that would be too easy  Not exact sequences – that would be too easy  Strength of Binding Affects level of promotion/inhibition (C/G vs A/T)  Described either probabilistically with motif logos or with extended single-letter nucleotide codes  Often are Palindromic (GATATC)

SymbolMeaning AAdenine GGuanine CCytosine TThymine UUracil YpYrimidine(C or T) RpuRine(A or G) W"Weak"(A or T) S"Strong"(C or G) K"Keto"(T or G) M"aMino"(C or A) Bnot A(C or G or T) Dnot C(A or G or T) Hnot G(A or C or T) Vnot T(A or C or G) X,N,?unknown(A or C or G or T)  TGASTMA – Promoter Sequence for several oncogenes Extended Single-Letter Codes  Letters represent possible bases in each position:

11 Motif Logos  Height of letters represents probability of being found in that location in the motif

12 Why do we care?  Gene regulation  transcriptional regulation  Can teach us about our complex signaling pathways  Drugs and Money

13 So…Finding Regulatory Motifs  Statistical Models paper (Liu et al)  Assumes: We have located genes that we expect to be co-regulated (microarrays, co-expression)

14 So…Finding Regulatory Motifs  Experimental methods of determining TF binding sites (Gel Shift assay, DNA Protection Assay)  Statistical models

15 Single-Site Model  Assumes: - Each sequence contains 1 motif - Sequences are generated by random draws from {A,C,G,T} with given prior probabilities - Motif has a frequency matrix for each position  Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the motif locations based on P(a k )

16 Gibbs Sampling Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

17 Repetitive Block-Motif Model  View K sequences as one long sequence of length n. Model probability of a motif starting at each position ‘i’.  Problems: - Lose evolutionary relationship between sequences - Allows multiple copies of motif in each sequence - Total number of occurrences unknown

18 The Rest of the Statistical Models Paper…  Much math: – Scoring motif candidates – Using potential motif dictionaries – Bayesian Prior Probabilities – Finding motifs with insertions in them (“gapped” motifs)  On to: Phylogenetic Footprinting

19 Phylogenetic Footprinting  Most of paper spent describing background, results  Methods are brief, not too deep

20 Let Evolution Be Your Guide  Phylogenetic Footprinting – “Identifying regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species”

21 Orthologs and Paralogs Gene duplicate within species: Paralog Same gene in species with common ancestor: Ortholog

22 Advantages  Doesn’t rely on reliably determining co-regulated genes (single-genome approach, non-trivial!)  Can be used to find regulatory elements specific to one single gene (caveat: conserved across species)

23 Standard Methods  Usually start with MSA (ProbCons,clustalw) – But, this can lose signal (short regulatory elements ~20bp, long promoter regions ~1000 bp) – Also, if species are evolutionarily close, nonfunctional regions may also be well conserved  Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …) – But, these don’t take into account relative phylogenetic relationships of sequences. Will weight closely related sequences too highly

24 The PF Algorithm Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

25 AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4 Small Example (merci, CS262)

26 Solution Parsimony score: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

An Exhaustive Algorithm W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1 \... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

28 Simple Recurrence W u [s] =  min ( W v [t] + h(s, t) ) v : children t of u Words Good: K-mer score at a node is the sum of its children’s best parsimony scores for that k-mer

29 Running Time W u [s] =  min ( W v [t] + h(s, t) ) v : children t of u O(k  4 2k ) time per node Number of species Average sequence length Motif length Total time O(n k (4 2k + l ))

30 FootPrinter  Avoids pitfalls of using MSA or general- purpose Motif-finding algorithms  Identifies all DNA motifs that appear to have evolved more slowly than the surrounding sequence  Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)

31 FootPrinter (2)  “Given n orthologous input sequences and the phylogenetic tree T relating them, [footprinter] is guaranteed to produce every set of k- mers, one from each input sequence, that have a parsimony score at most d with respect to T, where k and d are parameters specified by the user.

32 Parameters  Can set minimum threshold on fraction of the phylogeny that must be spanned for motifs with each parsimony score ‘s’.

33 Results  Examine 9 sets of orthologous or paralogous (works for duplicated genes that have since evolved as well) sequences.  Found: many old, + some highly conserved motifs of unknown function (time for the experimentalists!)

34 One example: Metallothionein Gene Family  Good test family: – Large number of promoter sequences – Wide variety of species – Large number of regulatory elements experimentally verified in several species.  Most binding sites are within 300 bp of start codon (ATG)

35  Inputs Sequences: 590 bp upstream of the start codon  Most found were present in multiple isoform families – gained accuracy by considering the paralogs, not just the orthologs

36 But, FootPrinter isn’t Perfect  Some known regulatory binding sites were missed. Why?  Ultimately, must be because the motifs were not well-enough conserved to be detected (but we can discuss more…)

37 FootPrinter Error (1)  Some binding sites not well matched in other species. Example: Thyroid hormone receptor T3R is conserved within rodents, but not beyond. Would need many closely related species to detect this motif.

38 FootPrinter Error (2-5)  Some motifs well conserved, but too short  InDels in middle of motif – could allow them, but would get many false +s  Some barely fail to meet statistical thresholds (close but no cigar)  Dimer TFs like two conserved regions with variable internal seq.