Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Phylogenetic Trees Lecture 4
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
(Regulatory-) Motif Finding
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
Some new sequencing technologies
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
Hidden Markov Models.
Fuzzy K means.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012 Adapted from Haixu Tang.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
CS5263 Bioinformatics RNA Secondary Structure Prediction.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Prediction of Secondary Structure of RNA
RNA folding & ncRNA discovery
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Motif Search and RNA Structure Prediction Lesson 9.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
RNA secondary structure Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3- Combinatorial Motif Finding Lecture 4-Statistical Motif Finding.
A Very Basic Gibbs Sampler for Motif Detection
Lecture 21 RNA Secondary Structure Prediction
Comparative RNA Structural Analysis
Stochastic Context Free Grammars for RNA Structure Modeling
(Regulatory-) Motif Finding
Phylogeny.
Presentation transcript:

Gibbs Sampling in Motif Finding

Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1, …, x N Maximizing log-odds likelihood ratio:

Gibbs Sampling AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1.Initialization: a.Select random locations in sequences x 1, …, x N b.Compute an initial model M from these locations 2.Sampling Iterations: a.Remove one sequence x i b.Recalculate model c.Pick a new location of motif in x i according to probability the location is a motif occurrence

Gibbs Sampling Initialization: Select random locations a 1,…, a N in x 1, …, x N For these locations, compute M: That is, M kj is the number of occurrences of letter j in motif position k, over the total

Gibbs Sampling Predictive Update: Select a sequence x = x i Remove x i, recompute model: where  j are pseudocounts to avoid 0s, and B =  j  j M

Gibbs Sampling Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

Gibbs Sampling Running Gibbs Sampling: 1.Initialize 2.Run until convergence 3.Repeat 1,2 several times, report common motifs

Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

Repeats, and a Better Background Model Repeat DNA can be confused as motif  Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles

Processing of Data 1.Selection of 3,000 genes Genes with most variable expression were selected 2.Clustering according to common expression K-means clustering 30 clusters, genes/cluster Clusters correlate well with known function 3.AlignACE motif finding 600-long upstream regions 50 regions/trial

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

Phylogenetic Footprinting (Slides by Martin Tompa)

Phylogenetic Footprinting (Tagle et al. 1988) Functional sequences evolve slower than nonfunctional ones Consider a set of orthologous sequences from different species Identify unusually well conserved regions

Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d. This problem is NP-hard.

Small Example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

Solution Parsimony score: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA A GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

An Exact Algorithm (generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Recurrence

O(k  4 2k ) time per node W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time

O(k  4 2k ) time per node Number of species Average sequence length Motif length Total time O(n k (4 2k + l )) W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time

Limits of Motif Finders Given upstream regions of coregulated genes:  Increasing length makes motif finding harder – random motifs clutter the true ones  Decreasing length makes motif finding harder – true motif missing in some sequences 0 gene ???

Limits of Motif Finders A (k,d)-motif is a k-long motif with d random differences per copy Motif Challenge problem: Find a (15,4) motif in N sequences of length L CONSENSUS, MEME, AlignACE, & most other programs fail for N = 20, L = 1000

RNA Secondary Structure

RNA and Translation

RNA and Splicing

Hairpin Loops Stems Bulge loop Interior loops Multi-branched loop

Modeling RNA Secondary Structure: Context-Free Grammars

A Context Free Grammar S  ABNonterminals: S, A, B A  aAc | aTerminals:a, b, c, d B  bBd | b Derivation: S  AB  aAcB  …  aaaacccB  aaaacccbBd  …  aaaacccbbbbbbddd Produces all strings a i+1 c i b j+1 d j, for i, j  0

Example: modeling a stem loop S  a W 1 u W 1  c W 2 g W 2  g W 3 c W 3  g L c L  agucg What if the stem loop can have other letters in place of the ones shown? ACGG UGCC AG U CG

Example: modeling a stem loop S  a W 1 u | g W 1 u W 1  c W 2 g W 2  g W 3 c| g W 3 u W 3  g L c| a L u L  agucg| agccg | cugugc More general: Any 4-long stem, 3-5-long loop: S  aW 1 u | gW 1 u | gW 1 c | cW 1 g | uW 1 g | uW 1 a W 1  aW 2 u | gW 2 u | gW 2 c | cW 2 g | uW 2 g | uW 2 a W 2  aW 3 u | gW 3 u | gW 3 c | cW 3 g | uW 3 g | uW 3 a W 3  aLu | gLu | gLc | cLg | uLg | uLa L  aL 1 | cL 1 | gL 1 | uL 1 L 1  aL 2 | cL 2 | gL 2 | uL 2 L 2  a | c | g | u | aa | … | uu | aaa | … | uuu ACGG UGCC AG U CG GCGA UGCU AG C CG GCGA UGUU CUG U CG

A parse tree: alignment of CFG to sequence ACGG UGCC AG U CG A C G G A G U G C C C G U S W1W1 W2W2 W3W3 L S  a W1 u W1  c W2 g W2  g W3 c W3  g L c L  agucg

Alignment scores for parses! We can define each rule X  s, where s is a string, to have a score. Example: W  a W’ u:3(forms 3 hydrogen bonds) W  g W’ c:2(forms 2 hydrogen bonds) W  g W’ u: 1(forms 1 hydrogen bond) W  x W’ z -1, when (x, z) is not an a/u, g/c, g/u pair Questions: -How do we best align a CFG to a sequence? (DP) -How do we set the parameters?(Stochastic CFGs)

The Nussinov Algorithm Let’s forget CFGs for a moment Problem: Find the RNA structure with the maximum (weighted) number of nested pairings A G A C C U C U G G G CG GC AG UC U A U G C G A A C G C GU CA UC AG C U G G A A G A A G G G A G A U C U U C A C C A A U A C U G A A U U G C ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG A

The Nussinov Algorithm Given sequence X = x 1 …x N, Define DP matrix: F(i, j) = maximum number of bonds if x i …x j folds optimally Two cases, if i < j: 1.x i is paired with x j F(i, j) = s(x i, x j ) + F(i+1, j-1) x i is not paired with x j F(i, j) = max{ k: i  k < j } F(i, k) + F(k+1, j) ij ij ij k

The Nussinov Algorithm Initialization: F(i, i-1) = 0; for i = 2 to N F(i, i) = 0;for i = 1 to N Iteration: For l = 2 to N: For i = 1 to N – l j = i + l – 1 F(i+1, j -1) + s(x i, x j ) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) Termination: Best structure is given by F(1, N) (Need to trace back; refer to the Durbin book)

The Nussinov Algorithm and CFGs Define the following grammar, with scores: S  a S u : 3 | u S a : 3 g S c : 2 | c S g : 2 g S u : 1 | u S g : 1 S S : 0 | a S : 0 | c S : 0 | g S : 0 | u S : 0 |  : 0 Note:  is the “” string Then, the Nussinov algorithm finds the optimal parse of a string with this grammar

The Nussinov Algorithm Initialization: F(i, i-1) = 0; for i = 2 to N F(i, i) = 0;for i = 1 to NS  a | c | g | u Iteration: For l = 2 to N: For i = 1 to N – l j = i + l – 1 F(i+1, j -1) + s(x i, x j ) S  a S u | … F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) S  S S Termination: Best structure is given by F(1, N)

Stochastic Context Free Grammars In an analogy to HMMs, we can assign probabilities to transitions: Given grammar X 1  s 11 | … | s in … X m  s m1 | … | s mn Can assign probability to each rule, s.t. P(X i  s i1 ) + … + P(X i  s in ) = 1

Example S  a S b : ½ a : ¼ b : ¼ Probability distribution over all strings x: x = a n b n+1, then P(x) = 2 -n  ¼ = 2 -(n+2) x = a n+1 b n, same Otherwise: P(x) = 0