Download presentation
Presentation is loading. Please wait.
1
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan Geiger, Shlomo Moran and Ydo Wexler
2
2 Sequence Comparison Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: alphabet ∑ of 4 letters Protein: alphabet ∑ of 20 letters
3
3 Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. Find homologous proteins Allows to predict structure and function Locate similar subsequences in DNA e.g: allows to identify regulatory elements Locate DNA sequences that might overlap Helps in sequence assembly
4
4 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Two basic variants of sequence alignment: Global – all characters in both sequences participate Needleman-Wunsch, 1970 Local – find related regions within sequences Smith-Waterman, 1981
5
5 Sequence Alignment - Example Input: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA Possible output: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: Perfect matches Mismatches Insertions & deletions (indel)
6
6 Scoring Function Score each position independently: Match: +1 Mismatch: -1 Indel: -2 Score of an alignment is sum of position scores Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
7
7 Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG--------FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH 113 + + G VL +G L +G AE+K + H KH I +Y E + +++ Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA 130 + R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129
8
8 Sequence vs. Structure Similarity 1A6M: Myoglobin1JL7: Hemoglobin
9
9 Global Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences in which all characters in both sequences participate The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences Uses a scoring function A dynamic programming algorithm
10
10 The Needleman-Wunsch (NW) Algorithm Suppose we have two sequences: s=s 1 …s n and t=t 1 …t m Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s 1 …s i and t 1 …t j. The grade for cell V(i, j) is: V(i-1, j)+ score (s i, -) V(I, j) = max V(i, j-1)+ score (-, t j ) V(i-1, j-1)+ score (s i, t j ) V(n,m) is the score for the best alignment between s and t
11
11 NW Algorithm – An Example Alphabet: DNA, ∑ = {A,C,G,T} Input: s = AAAC t = AGC Scoring scheme: score (x, x) = 1 score (x,-) = -2 score (x, y) = -1
12
12 NW Algorithm – An Example AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 AG-C AAAC -AGC AAAC A-GC AAAC
13
13 NW – Time and Space Complexity Time: Filling the matrix: Backtracing: Overall: Space: Holding the matrix: AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4 O(n·m) O(n+m) O(n·m)
14
14 NW – Space Complexity In real-life applications, n and m can be very large The space requirements of O(n·m) can be too demanding If n = m = 1000 we need O(1MB) space If n = m = 10000 we need O(100MB) space We can afford to perform extra computation to save space Looping over million operations takes less than seconds on modern workstations Can we trade space with time?
15
15 Why Do We Need So Much Space? We can do the same computation in O( min (n,m)) space: Compute V(i, j) column by column, storing only two columns in memory (or row by row if rows are shorter). However… Trace back information requires O(m·n) memory bytes. -2 -4 -6 -8 0 -2 -4 0 -2 -6 -3 -2 -5 -3 1 GAC A A A C
16
16 Space Efficient Version Input: sequences s=s 1 …s n and t=t 1 …t m to be aligned. Idea: perform divide and conquer find position (i, n/2) at which some best alignment crosses a midpoint Construct alignments A=s 1 …s n/2 vs. t=t 1 …t i and B=s n/2+1 …s n vs. t=t i+1 …t m Return AB s t
17
17 Finding a Midpoint The score of the best alignment that goes through i equals: score (s 1 …s n/2, t 1 …t i ) + score (s n/2+1 …s n, t i+1 …t m ) Thus, we need to compute these two quantities for all values of i
18
18 Finding a Midpoint Define F(i, j) = score (s 1 …s i, t 1 …t i ) B(i, j) = score (s i+1 …s n, t j+1 …t m ) F(i, j) + B(i, j) = score of best alignment through (i, j) Compute F(i, j) and B(i, j) in linear space complexity We compute F(i, j) in O( min (i, j)) We compute B(i, j) in exactly the same manner, going “backward” from B(n,m)
19
19 Time Complexity Time to find a mid-point: c·n·m (c - a constant) Size of recursive sub-problems is (n/2,i) and (n/2,m-i), hence: T(n,m) = c·n·m + T(n/2,i) + T(n/2,m-i) Lemma: T(n, m) 2c·n·m Proof: T(n,m) c·n·m + 2c(n/2)i + 2c(n/2)(m-i) = 2c·n·m.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.