Download presentation
Presentation is loading. Please wait.
1
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 2 Local alignments 06-12-2007
2
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] 06-11-2006 Sequence Analysis What can sequence alignment tell us about structure HSSP Sander & Schneider, 1991 ≥30% sequence identity
3
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] 06-11-2006 Sequence Analysis Pair-wise alignment Complexity of the problem Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2 n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments! T D W V T A L K T D W L - - I K
4
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] 06-11-2006 Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value from residue exchange matrix i-1 i j-1 j
5
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] 06-11-2006 Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value from residue exchange matrix i-1 i j-1 j How can we implement the affine gap penalty regime in this algorithm?
6
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] 06-11-2006 Sequence Analysis – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)= k Affine: gp(k)= + k Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments / Gap length penalty affine concave linear, General alignment score: S a,b =
7
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] 06-11-2006 Sequence Analysis Scoring and gap penalties Scoring: –DNA or protein? –Which evolutionary distance do we cover? Gap penalty: always less than 0 (i.e. they lower the alignment score) –Linear: g(k)= k –Affine: g(k)= + k harder to implement –Concave: g(k)=log(k) no efficient solution
8
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] 06-11-2006 Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x-1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y-1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalty regimes
9
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] 06-11-2006 Sequence Analysis Easy DP recipe for using linear, affine and concave gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –Cell[i-1, j-1]: apply score for cell[i-1, j-1] –preceding row until j-2: apply appropriate gap penalties –preceding column until i-2: apply appropriate gap penalties i-1 j-1
10
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] 06-11-2006 Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap. One can be converted into the other by adapting the g open penalty
11
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] 06-11-2006 Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK 0-12-14-16-18-20-22-24 T-128-9-6-5-9-11-14 D 09223-5-3-34 W-16-1325115490-21 V-18-10-4372119 15-16 L-20-14-223463137261 K-22-12-9173353395014 -34-2917392750 DWVTALK T83811998 D12168848 W12523265 V621288106 L46 96145 K85687513 These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores
12
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] 06-11-2006 Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:
13
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] 06-11-2006 Sequence Analysis Global alignments - review Take two sequences: X[j] and Y[j] The best alignment for X[1…i] and Y[1…j] is called M[i, j] Initiation: M[0,0]=0 Apply the equation Find the alignment with backtracing X[j] Y[i] 2
14
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] 06-11-2006 Sequence Analysis Global and local alignment Local (Semi-)Global A B A B A B BA AB Local (Semi-) Global A B A C C A BC ABC A B A C C Problem:
15
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] 06-11-2006 Sequence Analysis Local alignment What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally maximal: cannot make it better by trimming/extending the alignment Seq X: Seq Y:
16
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] 06-11-2006 Sequence Analysis Local alignment Why local? –Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence –Proteins have modular construction sharing domains between sequences Seq X: Seq Y: seq X: seq Y: seq X: seq Y:
17
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] 06-11-2006 Sequence Analysis Domains - example Immunoglobulin domain
18
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] 06-11-2006 Sequence Analysis Global local alignment Take the old, good equation Look at the result of the global alignment q e s - euqes- CAGCACTTGGATTCTCG- CA-C-----GATTCGT-G a) global align b) retrieve the result c) sum score along the result align. pos. sum
19
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] 06-11-2006 Sequence Analysis Local alignment – breaking the alignment A recipe –Just don’t let the score go below 0 –Start the new alignment when it happens –Where is the result in the matrix? Before:After: sum align. pos. q e s 0 - euqes- q e 0 s 0 - euqes- sum align. pos. sum align. pos.
20
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] 06-11-2006 Sequence Analysis Local alignment – the equation Init the matrix with 0’s Fill in the search matrix (forward step) Read the maximal value from anywhere in the matrix Find the result by performing trace-back M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j]) max 0 Great contribution to science! q e s - euqes-
21
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] 06-11-2006 Sequence Analysis Finding second best alignment We can find the best local alignment in the sequence But where is the second best one? … 1613 A 1815 G 1618201714 A 15171916 G 141618 A … …AGA… A clump Best alignment Scoring: 1 for match -2 for a gap
22
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] 06-11-2006 Sequence Analysis Clump of an alignment Alignments sharing at least one aligned pair A clump … A G 20 A 1916 G 18 A … …AGA…
23
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] 06-11-2006 Sequence Analysis Example: repeats proteins Local alignment traces showing similarity between repeats Note: repeats are similar motifs but need not be identical
24
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] 06-11-2006 Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) ‘Shadows’ of various alignments are visible – these all derive from parts of a same locally optimal alignment
25
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] 06-11-2006 Sequence Analysis Finding second best alignment Don’t let any matched pair contribute to the next alignment Recalculate the clump “Clear” the best alignment
26
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] 06-11-2006 Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) The highest scoring local alignment is indicated
27
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] 06-11-2006 Sequence Analysis Clumps gene Y gene X
28
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] 06-11-2006 Sequence Analysis Extraction of alignments – Waterman-Eggert algorithm 1. Repeat a.Retrieve the highest scoring alignment b.Set its trace to 0
29
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] 06-11-2006 Sequence Analysis General local DP algorithm for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) i-1 j-1 Penalty = P open + gap_length*P extension S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0
30
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] 06-11-2006 Sequence Analysis Example: general algorithm for local alignment DP algorithm with affine gap penalties (PAM250, Po=10, Pe=2) Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) DWVTALK T0-503110 D4-7-200-40 W-717-6-5-6-2-3 V-2-64002-2 L-4-221 6-3 K0 -20-35 DWVTALK T0003 D4000 W02100 V00259 L0011 K00 PAM250
31
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] 06-11-2006 Sequence Analysis Pitfalls of alignments Alignment is often not a reconstruction of evolution (common ancestor is usually extinct by the time of alignment) Repeats: matches to the same fragment seq X: seq Y:
32
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] 06-11-2006 Sequence Analysis Summary 1. Global e.g Needleman-Wunsch algorithm 2. Semi-global 3. Local e.g.Smith-Waterman algorithm 4. Many local alignments aka Waterman-Eggert algorithm What’s the number of steps in these algorithms? How much memory is used? seq X: seq Y: seq X: seq Y: seq X: seq Y: seq X: seq Y: These questions deal with the algorithmic complexity (time, memory) of alignment
33
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] 06-11-2006 Sequence Analysis Global alignment (“simple” recursive DP) Fill pre-row and pre- column with gap penalties to account for initial end gaps Perform trace-back from lower-right matrix cell, thereby accounting for terminal end gaps T G A - GTGAG- 1-2 -3-3-6 -3-20 -4 -5-4-4-3-3 -2 -10-8-6-4-20 Match/mismatch = 1/-1 Constant gap penalty = -2
34
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] 06-11-2006 Sequence Analysis Semi-global alignment Ignore initial end gaps –First row/column set to 0 Ignore terminal end gaps –Read the result from last row/column T G A - GTGAG- 13000 -20210 1 0 000000 Match/mismatch = 1/-1 Constant gap penalty = -2
35
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] 06-11-2006 Sequence Analysis Local alignment Ignore cells that score <= 0 Read the result (local alignment score) from the highest cell anywhere in the matrix –Perform trace-back starting from this cell. G G A - GTGAG- 110000 002010 000100 000000 Match/mismatch = 1/-1 Constant gap penalty = -2
36
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] 06-11-2006 Sequence Analysis Low complexity regions Local composition bias –Replication slippage: e.g. triplet repeats Results in spurious hits –Breaks down statistical models –Different proteins reported as hits due to similar composition –Up to ¼ of the sequence can be biased Widely used filtering program: SEGS (Wootton and Federhen, 1996) Huntington’s disease –Huntingtin gene of unknown (!) function –Repeats #: 6-35: normal; 36- 120: disease
37
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] 06-11-2006 Sequence Analysis Synteny Synteny is preservation of gene order –5 to 20 million years of divergence Sequencing and comparison of yeast species to identify genes and regulatory elements, Manolis Kellis et.al, Nature 423, 241 - 254 (15 May 2003) Arrows in figure represent genes; arrow direction indicates ‘+’ or ‘-’ strand of DNA Figure shows preservation of gene order in four yeast species
38
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] 06-11-2006 Sequence Analysis Genome vs. gene *) b stands for bases or nucleotides
39
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [39] 06-11-2006 Sequence Analysis Take-home messages (I) Know global, semi-global and local alignment (three types) Know three types of gap penalties ‘easy’ (fast) and general algorithm Pitfalls of global, semi-global and local alignment Low-complexity regions
40
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [40] 06-11-2006 Sequence Analysis Make sure you understand and can carry out 1. the ‘simple’ DP algorithm (for linear gap penalties, but with extension to affine penalties) for global, semi-global and local alignment 2. The general DP algorithm for global, semi- global and local alignment! Take-home messages (II)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.