Download presentation
Presentation is loading. Please wait.
Published byMitchell Harrison Modified over 9 years ago
1
JM - http://folding.chmcc.org 1 Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC
2
JM - http://folding.chmcc.org2 Outline of the lecture Wrapping up the previous lecture: a quick look at the NCBI Map Viewer and suffix trees by way of an example Inexact string matching: from generalizations of suffix trees to dynamic programming The dynamic programming algorithm for sequence alignment: how it works The dynamic programming algorithm for sequence alignment: why it works Limitations and faster heuristic approaches
3
JM - http://folding.chmcc.org3 Web watch: NCBI Map Viewer With the knowledge about STSs and physical maps (hopefully) acquired last week we can have another look at the NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview http://www.ncbi.nlm.nih.gov/genome/guide/human/
4
JM - http://folding.chmcc.org4 Computationally efficient and elegant solutions for the exact string matching problem: http://www-igm.univ-mlv.fr/~lecroq/string/index.html Christian Charras and Thierry Lecroq
5
JM - http://folding.chmcc.org5 The idea of the suffix tree method Phase 1: Preprocessing of the “text” A string with m characters has m suffixes, which can be represented as m leaves of a rooted directed tree. Consider for example T=cabca c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 For simplicity one leaf, due to the terminal character $, is not included. Problem What is the reason for adding the terminal character?
6
JM - http://folding.chmcc.org6 Suffix tree based matching: why does it work? Phase II: Search A substring of a string is a prefix of a suffix in that string. For example, a substring P=ab is a prefix of the suffix abca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P. c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 Problem Does the size of the alphabet matter (and if so, how)? Hint: how many edges may originate in a node, given that label of each edge out of a node has to start with a different character?
7
JM - http://folding.chmcc.org7 Generalized suffix tree for a set of strings and the longest common substring problem Consider for example two strings: T=cabca and U=bbcb. c a b c a $ T1 a b c a $ T2 b c a $ T3 $T4 $ T5 Remark By building the generalized suffix tree for a set of k strings of the total length m one can find the longest prefix-suffix match for all pairs of strings in O(m+k 2 ) time (an additional trick is required for that). b c b $ U1 b $ U2 U4 U3 $ $ b
8
JM - http://folding.chmcc.org8 Assembling DNA from fragment and the suffix- prefix matching problem Hierarchical sequencing: physical maps, clone libraries and shotgun (see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse) Definition The algorithmic problem of shotgun sequence assembly is to deduce the sequence of the DNA string from a set of sequenced and partially overlapping short substrings derived from that string. Analogy to physical map assembly: DNA sequence of a substring may be viewed as a precise ordered fingerprint (in analogy to STSs) and the suffix-prefix match determines if two substrings would be assembled together. In general, the shortest superstring problem (find the shortest string that contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem.
9
JM - http://folding.chmcc.org9 Inexact or approximate string matching Two major reasons for the importance of approximate matching in computational molecular biology are: i)Measurement (e.g. sequencing) errors and fuzzy nature of underlying molecular processes (e.g. hybridization may occur despite some mismatches) ii) Redundancy in biology with evolutionary processes resulting in closely related, yet, different sequences that require approximate matching in order to detect their relatedness and identify variable as well as conserved features that may reveal fingerprints of structure and function Either generalizations of exact string matching methods, such as suffix trees, or dynamic programming (or their heuristic combinations) are being used to solve this problem.
10
JM - http://folding.chmcc.org10 Redundancy in biological systems --MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI--AIIHVLHSRH DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H DLKKHGATVLTALGGILKKKGHHEAE-KPLAQSHATKHKIPVKYLEFISEC-IQVLQSKH PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG PG+FGADAQGAMNKALELFRKD+A+ YKE PGDFGADAQGAMNKALELFRKDMASNYKE----- Note that there are two types of mismatches: i)Due to point mutations ii)Due to insertions and deletions (gaps) An example: two globin-like sequences:
11
JM - http://folding.chmcc.org11 Gap penalties: evolutionary and computational considerations Linear gap penalties: (g) = - g d for a gap of length g and constant d Affine gap penalties: (g) = - [ d + (g -1) e ] where d is opening gap penalty and e an extension gap penalty.
12
JM - http://folding.chmcc.org12 Dynamic programming algorithm for string alignment Our goal is to find an optimal matching for two strings S 1 = a 1 a 2 …a n and S 2 = b 1 b 2 …b m over a certain alphabet , given a scoring matrix s(a,b) for each a and b in and (for simplicity) a linear gap penalty Relation to minimal edit distance (number of insertions, deletions and substitutions required to transform one string into the other) problem The similarity measure (scoring matrix) should represent biological relatedness and separate true matches from random alignments (find more in Chapter 2 of “Biological Sequence Analysis” by Durbin et. al.)
13
JM - http://folding.chmcc.org13 How many alignments are there? All the possible alignments (with gaps) may be represented in the Form of a DP graph (DP table). Consider an example with two strings of length 2: a1a1 a2a2 b1b1 b2b2 011 1 1 3 513 5 _ a 1 a 2 b 1 b 2 \ | | b 1 a 1 b 2 a 2 \_ | _ _ |_ _ |_ _ _ |_ _ | |_ |_ | | b 1 b 2 a 1 a 2 | | |_ \ a 1 b 1 a 2 b 2 \ \ \ _ |_ | a 1 b 1 b 2 a 2 |_ _ \ | b 1 a 1 a 2 b 2 \
14
JM - http://folding.chmcc.org14 Computing the number of alignments with gaps Definition A string of length n+m, obtained by intercalating two strings S 1 = a 1 a 2 …a n and S 2 = b 1 b 2 …b m, while preserving the order of the symbols in S 1 and S 2, will be referred to as an intercalated string and denoted by S 1/2. Note that S 1 and S 2 are subsequences of S 1/2 but in general they are not substrings of S 1/2. Definition Two alignments are called redundant if their score is identical. The relationship of “having the same score” may be used to define equivalence classes of non-redundant alignments. For example, the class a 1 b 1 b 2 a 2 : a 1 b 1 b 2 a 2 a 1 -a 2 a 1 a 2 - b 1 b 2 - ; b 1 -b 2
15
JM - http://folding.chmcc.org15 Computing the number of alignments with gaps Lemma There is one-to-one correspondence (bijection) between the set of the non-redundant gapped alignments of two strings S 1 and S 2 and the set of the intercalated strings { S 1/2 }. Corollary The number of non-redundant gapped alignments of two strings, of length n and m, respectively, is equal to (n+m)!/[m!n!]. Proof Since the order of each of the sequences is preserved when intercalating them, we have in fact n+m positions to put m elements of the second sequence (once this is done the position of each of the elements of the first sequence is fixed unambiguously). Hence, the total number of intercalated sequences S 1/2 is given by the number of m-element combinations of n+m elements and the corollary is a simple consequence of the one-to-one correspondence between alignments and intercalated sequences stated in the lemma. QED
16
JM - http://folding.chmcc.org16 Computing the number of alignments with gaps Problem Consider for simplicity two strings of the same length and using the Stirling formula (x! ~ (2 ) 1/2 x x+1/2 e -x ) show that: (n+n)!/[n!n!] ~ 2 2n / (2 n) 1/2 Note that for a very short by biology standards sequence of length n=50 one needs to perform about 10 30 basic operations for an exhaustive search, making the naïve approach infeasible. Dynamic programming provides in polynomial time an optimal solution for a class of optimization problems with exponentially scaling search space, including the approximate string matching.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.