Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Chapter 5: Tree Constructions
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Analysis of Algorithms
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.
Finding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
1 Chapter Elementary Notions and Notations.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Outline More exhaustive search algorithms Today: Motif finding
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MS Sequence Clustering
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Finding Regulatory Motifs in DNA Sequences
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
Motif Finding [1]: Ch , , 5.5,
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
CSCI2950-C Genomes, Networks, and Cancer
Gibbs sampling.
Character-Based Phylogeny Reconstruction
CSE 5290: Algorithms for Bioinformatics Fall 2011
Probabilities and Probabilistic Models
Finding Regulatory Motifs in DNA Sequences
CS 581 Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
(Regulatory-) Motif Finding
3. Brute Force Selection sort Brute-Force string matching
Topic 5: Heap data structure heap sort Priority queue
CSE 5290: Algorithms for Bioinformatics Fall 2009
3. Brute Force Selection sort Brute-Force string matching
Phylogeny.
Dynamic Programming II DP over Intervals
Clustering.
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Finding Regulatory Motifs in DNA Sequences

Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC

Transcription Factors and Motifs

The Motif Finding Problem (cont’d) Additional information: The hidden sequence is of length 8 The pattern is not exactly the same in each array because random point mutations may occur in the sequences

The Motif Finding Problem (cont’d) The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt Consensus String

The Motif Finding Problem (cont’d) The patterns with 2 point mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc Can we still find the motif, now that we have 2 mutations?

Defining Motifs To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,…,s t )

Defining Motifs cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc l = 8 t=5 s 1 = 26 s 2 = 21 s 3 = 3 s 4 = 56 s 5 = 60 s DNA n = 69

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2, …, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

Evaluating Motifs We have a guess about the consensus sequence, but how “good” is this consensus? Need to introduce a scoring function to compare different guesses and choose the “best” one.

Defining Some Terms t - number of sample DNA sequences n - length of each DNA sequence DNA - sample of DNA sequences (t x n array) l - length of the motif ( l -mer) s i - starting position of an l -mer in sequence i s=(s 1, s 2,… s t ) - array of motif’s starting positions

Scoring Motifs Given s = (s 1, … s t ) and DNA: Score(s,DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A C G T _________________ Consensus a c g t a c g t Score = 30 l t

The Motif Finding Problem If starting positions s=(s 1, s 2,… s t ) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus) But… the starting positions s are usually not given. How can we find the “best” profile matrix?

The Motif Finding Problem: Formulation Goal: Given a set of DNA sequences, find a set of l - mers, one from each sequence, that maximizes the consensus score Input: A t x n matrix of DNA, and l, the length of the pattern to find Output: An array of t starting positions s = (s 1, s 2, … s t ) maximizing Score(s,DNA)

The Motif Finding Problem: Brute Force Solution Compute the scores for each possible combination of starting positions s The best score will determine the best profile and the consensus pattern in DNA The goal is to maximize Score(s,DNA) by varying the starting positions s i, where: s i = [1, …, n- l +1] i = [1, …, t]

BruteForceMotifSearch 1.BruteForceMotifSearch(DNA, t, n, l ) 2.bestScore  0 3.for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n- l +1,..., n- l +1) 4.if (Score(s,DNA) > bestScore) 5.bestScore  score(s, DNA) 6.bestMotif  (s 1,s 2,..., s t ) 7.return bestMotif

Running Time of BruteForceMotifSearch Varying (n - l + 1) positions in each of t sequences, we’re looking at (n - l + 1) t sets of starting positions For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1) t = O( l n t ) That means that for t = 8, n = 1000, l = 10 we must perform approximately computations – it will take billions years

The Median String Problem Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of mutations This pattern will be the motif

Hamming Distance Hamming distance: d H (v,w) is the number of nucleotide pairs that do not match when v and w are aligned. For example: d H (AAAAAA,ACAAAC) = 2

Total Distance: Example Given v = “ acgtacgt ” and s acgtacgt cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgt agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgt aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgt ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc v is the sequence in red, x is the sequence in blue TotalDistance(v,DNA) = = 4 d H (v, x) = 2 d H (v, x) = 1 d H (v, x) = 0 d H (v, x) = 1

Total Distance: Definition For each DNA sequence i, compute all d H (v, x), where x is an l -mer with starting position s i (1 < s i < n – l + 1) Find minimum of d H (v, x) among all l -mers in sequence i TotalDistance(v,DNA) is the sum of the minimum Hamming distances for each DNA sequence i TotalDistance(v,DNA) = min s d H (v, s), where s is the set of starting positions s 1, s 2,… s t

The Median String Problem: Formulation Goal: Given a set of DNA sequences, find a median string Input: A t x n matrix DNA, and l, the length of the pattern to find Output: A string v of l nucleotides that minimizes TotalDistance(v,DNA) over all strings of that length

Median String Search Algorithm 1.MedianStringSearch (DNA, t, n, l ) 2.bestWord  AAA…A 3.bestDistance  ∞ 4. for each l -mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance 5. bestDistance  TotalDistance(s,DNA) 6. bestWord  s 7. return bestWord

Motif Finding Problem == Median String Problem The Motif Finding is a maximization problem while Median String is a minimization problem However, the Motif Finding problem and Median String problem are computationally equivalent Need to show that minimizing TotalDistance is equivalent to maximizing Score

We are looking for the same thing a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus a c g t a c g t Score TotalDistance Sum At any column i Score i + TotalDistance i = t Because there are l columns Score + TotalDistance = l * t Rearranging: Score = l * t - TotalDistance l * t is constant the minimization of the right side is equivalent to the maximization of the left side l t

Motif Finding Problem vs. Median String Problem Why bother reformulating the Motif Finding problem into the Median String problem? The Motif Finding Problem needs to examine all the combinations for s. That is (n - l + 1) t combinations!!! The Median String Problem needs to examine all 4 l combinations for v. This number is relatively smaller

Motif Finding: Improving the Running Time Recall the BruteForceMotifSearch: 1. BruteForceMotifSearch(DNA, t, n, l ) 2.bestScore  0 3.for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n- l +1,..., n- l +1) 4.if (Score(s,DNA) > bestScore) 5.bestScore  Score(s, DNA) 6.bestMotif  (s 1,s 2,..., s t ) 7.return bestMotif

Structuring the Search How can we perform the line for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n- l +1,..., n- l +1) ? We need a method for efficiently structuring and navigating the many possible motifs This is not very different than exploring all t- digit numbers

Median String: Improving the Running Time 1.MedianStringSearch (DNA, t, n, l ) 2.bestWord  AAA…A 3.bestDistance  ∞ 4. for each l -mer s from AAA…A to TTT…T if TotalDistance(s,DNA) < bestDistance 5. bestDistance  TotalDistance(s,DNA) 6. bestWord  s 7. return bestWord

Structuring the Search For the Median String Problem we need to consider all 4 l possible l -mers: aa… aa aa… ac aa… ag aa… at. tt… tt How to organize this search? l

Alternative Representation of the Search Space Let A = 1, C = 2, G = 3, T = 4 Then the sequences from AA…A to TT…T become: 11…11 11…12 11…13 11…14. 44…44 Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit l

Search Tree a- c- g- t- aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt -- root

Analyzing Search Trees Characteristics of the search trees: The sequences are contained in its leaves The parent of a node is the prefix of its children How can we move through the tree?

Moving through the Search Trees Four common moves in a search tree that we are about to explore: Move to the next leaf Visit all the leaves Visit the next node Bypass the children of a node

Visit the Next Leaf 1.NextLeaf( a,L, k ) // a : the array of digits 2.for i  L to 1 // L: length of the array 3. if a i < k // k : max digit value 4. a i  a i return a 6. a i  1 7.return a Given a current leaf a, we need to compute the “next” leaf:

NextLeaf (cont’d) The algorithm is common addition in radix k: Increment the least significant digit “Carry the one” to the next digit position when the digit is at maximal value

NextLeaf: Example Moving to the next leaf: Current Location

NextLeaf: Example (cont’d) Moving to the next leaf: Next Location

Visit All Leaves Printing all permutations in ascending order: 1.AllLeaves(L,k) // L: length of the sequence 2.a  (1,...,1) // k : max digit value 3. while forever // a : array of digits 4. output a 5. a  NextLeaf(a,L,k) 6. if a = (1,...,1) 7. return

Visit All Leaves: Example Moving through all the leaves in order: Order of steps

Depth First Search So we can search leaves How about searching all vertices of the tree? We can do this with a depth first search

Visit the Next Vertex 1.NextVertex(a,i,L,k) // a : the array of digits 2. if i < L // i : prefix length 3. a i+1  1 // L: max length 4. return ( a,i+1) // k : max digit value 5. else 6. for j  l to 1 7. if a j < k 8. a j  a j return( a,j ) 10. return(a,0)

Example Moving to the next vertex: Current Location

Example Moving to the next vertices: Location after 5 next vertex moves

Bypass Move Given a prefix (internal vertex), find next vertex after skipping all its children 1.Bypass(a,i,L,k) // a: array of digits 2.for j  i to 1 // i : prefix length 3. if a j < k // L: maximum length 4. a j  a j +1 // k : max digit value 5. return(a,j) 6. return(a,0)

Bypass Move: Example Bypassing the descendants of “2-”: Current Location

Example Bypassing the descendants of “2-”: Next Location