Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
The number of edge-disjoint transitive triples in a tournament.
Markov Chains Lecture #5
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Finding Regulatory Motifs in DNA Sequences
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Radial Basis Function Networks
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Finding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Copyright warning.
Finding Regulatory Motifs in DNA Sequences
Motif Finding [1]: Ch , , 5.5,
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Randomized Algorithms for Motif Finding [1] Ch 12.2.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Learning Sequence Motif Models Using Expectation Maximization (EM)
(Regulatory-) Motif Finding
CSE 5290: Algorithms for Bioinformatics Fall 2009
Clustering.
Presentation transcript:

Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20

Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Implanting Motif AAAAAAGGGGGGG with Four Mutations atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG cAAtAAAAcGGcGGG..|..|||.|..|||

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Challenge Problem Find a motif in a sample of - 20 “random” sequences (e.g. 600 nt long) - each sequence containing an implanted pattern of length 15, - each pattern appearing with 4 mismatches as (15,4)-motif.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Combinatorial Gene Regulation A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can one gene have such drastic effects?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Regulatory Proteins Gene X encodes regulatory protein, a.k.a. a transcription factor (TF) The 20 unexpressed genes rely on gene X’s TF to induce transcription A single TF may regulate multiple genes

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Regulatory Regions Every gene contains a regulatory region (RR) typically stretching bp upstream of the transcriptional start site Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transcription Factor Binding Sites A TFBS can be located anywhere within the Regulatory Region TFBS may vary slightly across different regulatory regions since non-essential bases could mutate

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transcription Factors and Motifs

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Logo Motifs can mutate on non important bases The five motifs in five different genes have mutations in position 3 and 5 Representations called motif logos illustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Logos: An Example (

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Identifying Motifs Genes are turned on or off by regulatory proteins These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS) So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Identifying Motifs: Complications We do not know the motif sequence We do not know where it is located relative to the genes start Motifs can differ slightly from one gene to the next How to discern it from “random” motifs?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem Given a random sample of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the pattern that is implanted in each of the individual sequences, namely, the motif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem (cont ’ d) Additional information: The hidden sequence is of length 8 The pattern is not exactly the same in each array because random point mutations may occur in the sequences

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem (cont ’ d) The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt Consensus String

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem (cont ’ d) The patterns with 2 point mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem (cont ’ d) The patterns with 2 point mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc Can we still find the motif, now that we have 2 mutations?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Defining Motifs To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,…,s t )

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2, …, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Consensus Think of consensus as an “ancestor” motif, from which mutated motifs emerged The distance between a real motif and the consensus sequence is generally less than that for two real motifs

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Consensus (cont ’ d)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Evaluating Motifs We have a guess about the consensus sequence, but how “good” is this consensus? Need to introduce a scoring function to compare different guesses and choose the “best” one.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Defining Some Terms t - number of sample DNA sequences n - length of each DNA sequence DNA - sample of DNA sequences (t x n array) l - length of the motif ( l -mer) s i - starting position of an l -mer in sequence i s=(s 1, s 2,… s t ) - array of motif’s starting positions

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Parameters cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc l = 8 t=5 s 1 = 26 s 2 = 21 s 3 = 3 s 4 = 56 s 5 = 60 s DNA n = 69

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Scoring Motifs Given s = (s 1, … s t ) and DNA: Score(s,DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A C G T _________________ Consensus a c g t a c g t Score = 30 l t

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem If starting positions s=(s 1, s 2,… s t ) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus) But… the starting positions s are usually not given. How can we find the “best” profile matrix?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem: Formulation Goal: Given a set of DNA sequences, find a set of l - mers, one from each sequence, that maximizes the consensus score Input: A t x n matrix of DNA, and l, the length of the pattern to find Output: An array of t starting positions s = (s 1, s 2, … s t ) maximizing Score(s,DNA)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Motif Finding Problem: Brute Force Solution Compute the scores for each possible combination of starting positions s The best score will determine the best profile and the consensus pattern in DNA The goal is to maximize Score(s,DNA) by varying the starting positions s i, where: s i = [1, …, n- l +1] i = [1, …, t]

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Running Time of BruteForceMotifSearch Varying (n - l + 1) positions in each of t sequences, we’re looking at (n - l + 1) t sets of starting positions For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1) t = O( l n t ) That means that for t = 8, n = 1000, l = 10 we must perform approximately computations – it will take billions years

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Median String Problem Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of mutations This pattern will be the motif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hamming Distance Hamming distance: d H (v,w) is the number of nucleotide pairs that do not match when v and w are aligned. For example: d H (AAAAAA,ACAAAC) = 2

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Total Distance: An Example Given v = “ acgtacgt ” and s acgtacgt cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgt agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgt aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgt ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc v is the sequence in red, x is the sequence in blue TotalDistance(v,DNA) = 0 d H (v, x) = 0

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Total Distance: Example Given v = “ acgtacgt ” and s acgtacgt cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgt agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgt aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgt ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc v is the sequence in red, x is the sequence in blue TotalDistance(v,DNA) = = 4 d H (v, x) = 2 d H (v, x) = 1 d H (v, x) = 0 d H (v, x) = 1

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Total Distance: Definition For each DNA sequence i, compute all d H (v, x), where x is an l -mer with starting position s i (1 < s i < n – l + 1) Find minimum of d H (v, x) among all l -mers in sequence i TotalDistance(v,DNA) is the sum of the minimum Hamming distances for each DNA sequence i TotalDistance(v,DNA) = min s d H (v, s), where s is the set of starting positions s 1, s 2,… s t

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The Median String Problem: Formulation Goal: Given a set of DNA sequences, find a median string Input: A t x n matrix DNA, and l, the length of the pattern to find Output: A string v of l nucleotides that minimizes TotalDistance(v,DNA) over all strings of that length

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Finding Problem == Median String Problem The Motif Finding is a maximization problem while Median String is a minimization problem However, the Motif Finding problem and Median String problem are computationally equivalent Need to show that minimizing TotalDistance is equivalent to maximizing Score

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info We are looking for the same thing a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus a c g t a c g t Score TotalDistance Sum At any column i Score i + TotalDistance i = t Because there are l columns Score + TotalDistance = l * t Rearranging: Score = l * t - TotalDistance l * t is constant the minimization of the right side is equivalent to the maximization of the left side l t

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Finding Problem vs. Median String Problem Why bother reformulating the Motif Finding problem into the Median String problem? The Motif Finding Problem needs to examine all the combinations for s. That is (n - l + 1) t combinations!!! The Median String Problem needs to examine all 4 l combinations for v. This number is relatively smaller

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Finding: Improving the Running Time 1. BruteForceMotifSearch(DNA, t, n, l ) 2.bestScore  0 3.for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n- l +1,..., n- l +1) 4.if (Score(s,DNA) > bestScore) 5.bestScore  Score(s, DNA) 6.bestMotif  (s 1,s 2,..., s t ) 7.return bestMotif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Structuring the Search How can we perform the line for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n- l +1,..., n- l +1) ? We need a method for efficiently structuring and navigating the many possible motifs This is not very different than exploring all t- digit numbers

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Median String: Improving the Running Time 1.MedianStringSearch (DNA, t, n, l ) 2.bestWord  AAA…A 3.bestDistance  ∞ 4. for each l -mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance 5. bestDistance  TotalDistance(s,DNA) 6. bestWord  s 7. return bestWord

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Structuring the Search For the Median String Problem we need to consider all 4 l possible l -mers: aa… aa aa… ac aa… ag aa… at. tt… tt How to organize this search? l

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Alternative Representation of the Search Space Let A = 1, C = 2, G = 3, T = 4 Then the sequences from AA…A to TT…T become: 11…11 11…12 11…13 11…14. 44…44 Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit l

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Search Tree a- c- g- t- aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt -- root

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Three Operations NextLeaf NextVertex ByPass

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info NextLeaf: Example Moving to the next leaf: Next Location

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Depth First Search So we can search leaves How about searching all vertices of the tree? We can do this with a depth first search Preorder traversal

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info NextVertex: Example Moving to the next vertices: Location after 5 next vertex moves

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info ByPass: Example Bypassing the descendants of “2-”: Next Location

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Revisiting Brute Force Search Now that we have method for navigating the tree, lets look again at BruteForceMotifSearch NextLeaf NextVertex ByPass

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Can We Do Better? Sets of s=(s 1, s 2, …,s t ) may have a weak profile for the first i positions (s 1, s 2, …,s i ) Every row of alignment may add at most l to Score Optimism: if all subsequent (t-i) positions (s i+1, …s t ) add (t – i ) * l to Score(s,i,DNA) If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree Use ByPass()

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Branch and Bound Algorithm for Motif Search Since each level of the tree goes deeper into search, discarding a prefix discards all following branches This saves us from looking at (n – l + 1) t-i leaves Use NextVertex() and By Pass() to navigate the tree

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Branch and Bound Applied to Median String Search Note that if the total distance for a prefix is greater than that for the best word so far: TotalDistance (prefix, DNA) > BestDistance there is no use exploring the remaining part of the word We can eliminate that branch and B YPASS exploring that branch further

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info More on the Motif Problem Exhaustive Search and Median String are both exact algorithms They always find the optimal solution, though they may be too slow to perform practical tasks Many algorithms sacrifice optimal solution for speed

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info CONSENSUS: Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations CONSENSUS finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Introduction to Bioinformatics Algorithms Randomized Motif Finding

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Randomized Algorithms Randomized algorithms make random rather than deterministic decisions. The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time. These algorithms are commonly used in situations where no exact and fast algorithm is known.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info A New Motif Finding Approach Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences. Previously: we solved the Motif Finding Problem using a Branch and Bound or a Greedy technique. Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Profiles Revisited Let s=(s 1,...,s t ) be the set of starting positions for l-mers in our t sequences. The substrings corresponding to these starting positions will form: - t x l alignment matrix and - 4 x l profile matrix* P. *We make a special note that the profile matrix will be defined in terms of the frequency of letters, and not as the count of letters.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P. If a is very similar to the consensus string of P then Prob(a|P) will be high If a is very different, then Prob(a|P) will be low. n Prob(a|P) =Π p a i, i i=1 Scoring Strings with a Profile

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Scoring Strings with a Profile (cont ’ d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = ??? The probability of the consensus string:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Scoring Strings with a Profile (cont ’ d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Scoring Strings with a Profile (cont ’ d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string: Probability of a different string:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mer Define the P-most probable l-mer from a sequence as an l-mer in that sequence which has the highest probability of being created from the profile P. A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P = Given a sequence = ctataaaccttacatc, find the P-most probable l-mer

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Third try: c t a t a a a c c t t a c a t c Second try: c t a t a a a c c t t a c a t c First try: c t a t a a a c c t t a c a t c P-Most Probable l-mer (cont ’ d) A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Find the Prob(a|P) of every possible 6-mer: -Continue this process to evaluate every possible 6-mer

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mer (cont ’ d) String, Highlighted in RedCalculationsprob(a|P) ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00 ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00 ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00 ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/ Compute prob(a|P) for every possible 6-mer:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mer (cont ’ d) String, Highlighted in RedCalculationsProb(a|P) ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00 ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00 ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00 ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/ P-Most Probable 6-mer in the sequence is aaacct:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mer (cont ’ d) ctataaaccttacatc because Prob(aaacct|P) =.0336 is greater than the Prob(a|P) of any other 6-mer in the sequence. aaacct is the P-most probable 6-mer in:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Dealing with Zeroes In our toy example prob(a|P)=0 in many cases. In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small. To avoid many entries with prob(a|P)=0, there exist techniques to equate zero to a very small number so that one zero does not make the entire probability of a string zero (we will not address these techniques here).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mers in Many Sequences Find the P-most probable l-mer in each of the sequences. ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtataccttacatc tgcattcaatagctta tatcctttccactcac ctccaaatcctttaca ggtcatcctttatcct A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P=P=

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info P-Most Probable l-mers in Many Sequences (cont ’ d) ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct P-Most Probable l-mers form a new profile 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Comparing New and Old Profiles Red – frequency increased, Blue – frequency descreased 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8 A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Greedy Profile Motif Search Use P-Most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif. 1)Select random starting positions. 2)Create a profile P from the substrings at these starting positions. 3)Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a. 4)Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info GreedyProfileMotifSearch Analysis Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif. It is unlikely that the random starting positions will lead us to the correct solution at all. In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling GreedyProfileMotifSearch is probably not the best way to find motifs. However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one. Gibbs Sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info How Gibbs Sampling Works 1) Randomly choose starting positions s = (s 1,...,s t ) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences. 3) Create a profile P from the other t -1 sequences. 4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P. 5) Choose a new starting position for the removed sequence at random based on the probabilities calculated in step 4. 6) Repeat steps 2-5 until there is no improvement

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example Input: t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 1) Randomly choose starting positions, s=(s 1,s 2,s 3,s 4,s 5 ) in the 5 sequences: s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 3) Create profile P from l-mers in remaining 4 sequences: 1AATATTTA 3TCAAGCGT 4GTAAACGA 5TACTTAAC A1/42/4 3/41/4 2/4 C01/4 002/401/4 T2/41/4 2/41/4 G /40 Consensus String TAAATCGA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 4) Calculate the prob(a|P) for every possible 8- mer in the removed sequence: Strings Highlighted in Red prob(a|P) AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG0 0 0

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 5) Create a distribution of probabilities of l- mers prob(a|P), and randomly select a new starting position based on this distribution. Starting Position 1: prob( AAAATTTA | P ) = / = 6 Starting Position 2: prob( AAATTTAC | P ) = / = 1 Starting Position 8: prob( ACCTTAGA | P ) = / = 1.5 a) To create this distribution, divide each probability prob(a|P) by the lowest probability: Ratio = 6 : 1 : 1.5

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Turning Ratios into Probabilities Probability (Selecting Starting Position 1): 6/( )= Probability (Selecting Starting Position 2): 1/( )= Probability (Selecting Starting Position 8): 1.5/( )=0.176 b) Define probabilities of starting positions according to computed ratios

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example c) Select the start position according to computed ratios: P(selecting starting position 1):.706 P(selecting starting position 2):.118 P(selecting starting position 8):.176

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions. s 1 =7 GTAAACAATATTTATAGC s 2 =1 AAAATTTACCTCGCAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =5 TGAGTAATCGACGTCCCA s 5 =1 TACTTCACACCCTGTCAA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampling: an Example 6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gibbs Sampler in Practice Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach). Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. Needs to be run with many randomly chosen seeds to achieve good results.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Another Randomized Approach Random Projection Algorithm is a different way to solve the Motif Finding Problem. Guiding principle: Some instances of a motif agree on a subset of positions. However, it is unclear how to find these “non- mutated” positions. To bypass the effect of mutations within a motif, we randomly select a subset of positions in the pattern creating a projection of the pattern. Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. This can be viewed as a projection of l- dimensional space onto k-dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 Projection Projection = (2, 4, 5, 7, 11, 12, 13)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Random Projections Algorithm Select k out of l positions uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from enriched bucket that contain many l-tuples. Bucket TGCT TGCACCT Input sequence: …TCAATGCACCTAT...

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Random Projections Algorithm (cont ’ d) Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing. In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good” ATGCGTC...ccATCCGACca......ttATGAGGCtc......ctATAAGTCgc......tcATGTGACac... (7,2) motif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example l = 7 (motif size), k = 4 (projection size) Choose projection (1,2,5,7) GCTC...TAGACATCCGACTTGCCTTACTAC... Buckets ATGC ATCCGAC GCCTTAC

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain more than s l- tuples, for some parameter s. ATTCCATCGCTC ATGC

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. Gibbs sampler. Local refinement algorithm ATGCGAC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGCGAC

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Synergy between Random Projection and Gibbs Sampler Random Projection is a procedure for finding good starting points: every enriched bucket is a potential starting point. Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point. These algorithms work particularly well for “good” starting points.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Building Profiles from Buckets A C G T Profile P Gibbs sampler Refined profile P* ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC

A Graph-Based Method for Motif Finding Graphs com from

Problems with Current Representations of DNA Motifs All current methods for representing DNA motifs involve either consensus sequences or probabilistic models (such as PSSMs) of the motif. Consensus sequences do not adequately represent the variability seen in promoters or transcription factor binding sites. Both consensus sequences and PSSM models assume positional independence. Neither method can accommodate correlations between positions.

Yeast Motifs We analyzed yeast motifs for pairwise dependencies.  We used a chi-square statistic to find whether two positions were correlated or not. We found that 25% of motifs have significantly correlated positions.

Graph Representation

A Graph-Based Model of a Motif The main idea is to formulate motif finding as the problem of finding the Maximum Density Subgraph Given a set of DNA sequences, we construct a graph G=(V,E)  V: the set of the k-mer occurrences in the input  E: the set of undirected edges, represents nucleotide similarities between those k-mers

Maximum Density Subgraph

Nucleotide Dependencies and MDS

Things to be Done Graph construction  Edge weight assignment  Background normalization Finding maximum density subgraph(s)  Choice of density function  Max-Flow/Min-Cut  Neighborhood subgraph  …

The Modeling of Regulatory Regions

More than a hundred methods have been proposed for motif discovery  A large variation with respect to both algorithmic approaches as well as the underlying models of regulatory regions

Key Points How the multiple binding sites for modules of regulatory elements can be modeled How additional data sources may be integrated into such models

A Integrated Framework to Model Regulatory Regions A single motif, denoted by m g, consists of two parts,  m * : how well the sequence matches a consensus  o g is a prior on whether any regulatory element is to occur at that position A set of single motifs, together with inter-motif distance restrictions (d), then forms a composite motif (c g ) Multiple occurrences of a composite motif in the regulatory regions of a gene is represented by a gene score G c

A Schematic View of the Integrated Framework

Single Motif Models (Level 1) Transcription factor binding sites: This is the most basic element of the regulatory system, and can be modeled using single motif models A single motif model is defined as a function m g that maps a sequence position p to a real numbered motif score m g (p)  It consists of a match score m * (p) and an occurrence prior o g (p)

The Match Model m * Probabilistic match model  Position weight matrix (PWM) a.k.a. PSSM  Incorporating dependencies within motifs n-th order Markov chains Bayesian network Deterministic match model  oligos  regular expression  mismatch expressions

Occurrence priors o g Spatial distribution of binding sites Conservation in orthologous sequences DNA structure Nucleotide distribution

Composite Motif Models (Level 2) Clusters of binding sites for cooperating TFs, often called modules, are believed to be essential building blocks of the regulatory machinery A composite motif model is defined as a function c g that maps a set of single motif sequence positions p to a real numbered composite motif score c g (p).

A Composite Motif Score c g Given a set of positions, the score of a composite motif will typically be  the sum or product of individual single motif, and  distance scores

Distance Functions constraints  fixed distances  distances below thresholds  distances within intervals  all single motifs to be within a window of a certain length score functions  uniform  distribution: geometry, Poisson, etc.

Combining Single Motifs For methods using deterministic match models and constraints on distances  intersection of component scores  M out of N single motif scores For methods that use non-binary single motif scores  the sum/product of single motif and distance scores Specialized models  hidden Markov model (HMM)  self-organizing map (SOM)  artificial neural network (ANN)

Gene level models (Level 3) A gene score model is defined as a function G c that maps a gene index g to a real numbered gene score G c (g) The gene level score is calculated from composite motif scores, c g, across the regulatory region of gene g, and is referred to as gene score

Multiple Binding Sites/Modules One motif in the regulatory region  The gene level score is often defined simply as the maximum motif score in the regulatory region of a gene Multiple binding sites/modules  sum of log-scores of motifs  logistic-regression  ANN  HMM  …

Overview of Methods