1 Pairwise Sequence Alignment
2 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT Global alignment
3 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT local alignment
4 Discover function Sequences that are similar probably have the same function
5 Study evolution If two sequences from different organisms are similar, they may have been a common ancestor
6 Find crucial features –Regions in the sequences that are strongly conserved between different sequences can indicate their functional importance Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.
7 Identify cause of disease –Comparison of sequences between individuals can detect changes that are related to diseases
8 Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source:
9 Healthy Individual >gi| |ref|NM_ | Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG A GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
10 Diseased Individual >gi| |ref|NM_ | Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG T GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
11 Sequence Modifications Three types of mutation –Substitution (point mutation) –Insertion –Deletion TCAGTTCGAGT TCCGT TCGT TCAGT Indel (replication slippage)
12 How do we quantitate similarity?
13 Scoring Similarity Assume independent mutation model –Each site considered separately Score at each site –Positive if the same –Negative if different Sum to make final score –Can be positive or negative –Significance depends on sequence length GTAGTC CTAGCG
14 Substitutions Only Pretend there are no indels –Sequences compared base-by-base –Count the number of matches and mismatches –Matches score +2, Mismatches score -1 TTCGTCGTAGTCGGCTCGACCTG GTACGTCTAGCGAGCGTGATCCT 9 matches mismatches-14 Total score +4 A weak match
15 Including Indels Create an ‘alignment’ –Count matches within alignment –Required if sequences are different length TT-CGTCGTAGTCG-GC-TCGACC-TG GTACGTC-TAG-CGAGCGT-GATCCT- 17 matches+34 2 mismatches- 2 8 indels- 8 Total score +24 A strong match
16 Choosing an Alignment Many different alignments are possible –Should consider all possible –Take the best score found –There may be more than one best alignment TT-CGTCGTAGTCG-GC-TCGACC-TG GTACGTC-TAG-CGAGCGT-GATCCT TTCGT-CGTAGTC-GGCTCG-ACCTG GTAC-GTCTA-GCGAGCGT-GATCC-T 0
17 Why is it hard ? Alignment (without gaps) requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length. If we include gaps the number of comparisons becomes astronomical
18 Algorithms for pairwise alignments Dot Plots – Gibbs and McIntyre 1970 Dynamic Programming : Local alignment : Smith- Waterman Global alignment :Needelman-Wunsch
19 Dot Plots Early method Sequences at top and left Dots indicate matched bases Diagonal series show matched regions GTAGTCGG T A G C G A G C TAGTCG TAG-CG
20 Dynamic Programming A method for reducing a complex problem to a set of identical sub-problems The best solution to one sub-problem is independent from the best solution to the other sub-problem
21 Dynamic Programming A method for reducing a complex problem to a set of identical sub-problems The best solution to one sub-problem is independent from the best solution to the other sub-problem
22 what does it mean? If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z
23 Example Sequences: A = ACGCTG, B = CATGT A C G C T G C A T G T ?
24 Example Score of best alignment between AC and CATG Sequences: A = ACGCTG, B = CATGT -2 …between AC and CATGT 2 …between ACG and CATG Calculate score between ACG and CATGT ? Match:+2, Other:-1
25 Needleman-Wunsch Example Insertion in the first sequence Align the next letter in sequence 1 and 2 Insertion in the Second sequence
26 Sequences: A = ACGCTG, B = CATGT Needleman-Wunsch Example from before plus -1 for mismatch of G against T -2 2 from before plus -1 for mismatch of – against T 1 -2 from before plus -1 for mismatch of G against – -3 1 Cell gets highest score of -2, 1, -3 1
27 Sequences: A = ACGCTG, B = CATGT Needleman-Wunsch Example
28 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 A 2 T 3 G 4 T 5
29 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 A 2 T 3 G 4 T 5 A-A-
30 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 A 2 T 3 G 4 T 5 ACGCTG
31 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 A 2 -2 T 3 -3 G 4 -4 T CATGT
32 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACAC
33 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 1 A 2 -2 T 3 -3 G 4 -4 T 5 -5 AC -C
34 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 10 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACG -C-
35 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C 1 10 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACGC -C-- ACGC ---C
36 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C A T 3 -3 G 4 -4 T 5 -5 ACG -CA
37 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C A T G T
38 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G C A T G T
39 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G T 5 32
40 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G T 5 32 ACGCTG- -C-ATGT
41 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G T 5 32 ACGCTG- -CA-TGT
42 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G T ACGCTG CATG-T-
43 Needleman-Wunsch Alignment Global alignment between sequences –Compare entire sequence against another Create scoring table –Sequence A across top, B down left Cell at column i and row j contains the score of best alignment between the first i elements of A and the first j elements of B –Global alignment score is bottom right cell Summary
44 Global vs. Local alignment DOROTHY HODGKIN Global alignment: DOROTHY HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
45 Global Alignment versus Local Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Global Alignment Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
46 Local Alignment Best score for aligning part of sequences –Often beats global alignment score Similar algorithm: Smith-Waterman –Table cells never score below zero
47 Local Alignment How do we do it ? 1.We can start a new match instead of extending a previous alignment. –This means- at each cell, we can start to calculate the score from 0 (even if this means ignoring the prefix). –We do this only if it’s better than the alternative (which means- only if the alternative is negative). 2.Instead of looking only at the far corner, we look anywhere in the table for the best score (even if this means ignoring the suffix)