Download presentation
Presentation is loading. Please wait.
Published byCornelia Lucas Modified over 6 years ago
1
Bioinformatics Algorithms Based on © Pevzner and Jones Revised 2015
Sequence Alignment Bioinformatics Algorithms Based on © Pevzner and Jones Revised 2015 The capacity to blunder slightly is the real marvel of DNA. Without this special attribute, we would still be anaerobic bacteria and there would be no music. - Lewis Thomas
2
Outline Homework Introduce Scoring Matrices
Some mismatches are better than others Solve Alignment with Affine Gap Penalties Continuing deletion more common than starting one Linear Space Alignment Multiple Alignment
3
Matching Sequences Odds that all match = p6 = 1/46 = 1/4096 ~ 0.00024
Odds that none match. Odds are q6 = 36/46 ~ 0.178 Let’s break down the case where there is a single match. The odds that the first match, but none other do, is pq5 But we could match in any of the six spots, so the odds of a single match are 6 pq5 = 6 x 0.25 x (0.75)5 ~ 0.356
4
Matching Sequences To match the first two, the odds are p2q4. But to capture all possible matches with two correct, we need to see how many ways we can pick 2 out of 6. This is the combinatorial coefficient C(6, 2), pronounced “Six choose two” and equal to 6 x 5 / 2 = 15. This fits the patterns above: we have C(6, 0) = 1, C(6, 1) = 6, and C(6, 6) = 1. In general, the odds of matching are C(6, k)pkq6-k
5
Repeats Which is largest k for which 4^k + k <= 1,000,000?
k = 10 is a bit too large, so must settle for 9: you will probably have a repeat of length 10, but you will always have a repeat of length 9
6
Repeats What is the longest repeat that has a 50% chance of appearing?
7
Repeats What is the longest repeat that has a 50% chance of appearing? 16 bins
8
Repeats What is the longest repeat that has a 50% chance of appearing? 64 bins
9
Repeats What length do we expect to see at least 50% of the time?
A string of length N has N – k + 1 substrings of length k. The odds that there are no repeats of length k is We want 1 – v > ½. or v < ½ This first happens when k = 19
10
A few words on real numbers
In bad old days there were flakey floating point chips IEEE 754 is a standard that has been widely adopted in modern chips. Addresses many serious problems. We still need to know three terms Overflow – 10 lbs of sugar in a 5 lb sack Underflow – Numbers too small to distinguish from 0 Machine Epsilon – Distance between 1.0 and the next number we can represent Numerical Analysis goes over this and much more
11
So what about this… When I try to compute this I get overflow!
12
So what about this… So don’t do that. You can’t compute 65! But each of the terms in the bottom equation is easy to compute
13
Repeat Odds def printRepeatOdds(k, limit, debug): print k, bins = 1.0
for x in xrange(k): bins = bins*4.0 if (bins < limit): print "\t Will always have a repeat" return print "\t", bins, "\t", odds = 1.0 for x in xrange(limit): odds = odds * ((bins - x)/bins) print 1 - odds
14
Repeat Odds def printRepeatOdds(k, limit, debug): ...
print "\t", bins, "\t", odds = 1.0 for x in xrange(limit): odds = odds * ((bins - x)/bins) print 1 - odds print "\nK:", "\tBins:", "\tOdds of a repeat" for k in xrange(1, 22): printRepeatOdds(k, k + 1, False)
15
Palindrome output 3 103 AAATTT 4 102 AAAATTTT
$ python findPalindrome.py ../../../data/EColi.fasta ... AAA TTT AAAA TTTT 5 101 TAAAA TTTTA CATGGTTATG CATAACCATG GCATGGTTATG CATAACCATGC TGCATGGTTATG CATAACCATGCA CTGCATGGTTATG CATAACCATGCAG TCTGCATGGTTATG CATAACCATGCAGA TTCTGCATGGTTATG CATAACCATGCAGAA AAATTT AAAATTTT
16
How much work? > Fake Fasta ATGCATCCCCATATATATATATAT
Do we need to pass over initial data each pass?
17
Better version > Fake Fasta 3 103 AAATTT ATGCATCCCCATATATATATATAT
AAAATTTT ATGCATCCCCATATATATATATAT Can skip areas where we didn’t find long palindrome
18
Palindrome # findPalindrome.py Find Palindromes # Jeff Parker Jan 2015
# Usage: $ python findPalindrome.py <filename> ... # Is this a palindrome? def isPalindrome(pos, text, patLen, reverse): for x in xrange(2*patLen): if (reverse[text[pos + x]] != text[pos+2*patLen - x - 1]): return False return True
19
Palindrome # Look for palindrome of given length.
def findFirstPalindrome(start, text, patLen, reverse): # Return first palindrome of this length for pos in xrange(start, len(text) - 2 * patLen + 1): if (isPalindrome(pos, text, patLen, reverse)): return pos return -1 # Never found one of this length
20
Alternative def findFirstPalindrom(start, text, patLen):
for pos in xrange(start, len(text) - 2 * patLen + 1): match = True for x in xrange(2*patLen): if (reverse[text[pos + x]] != text[pos+2*patLen - x - 1]): match = False break # Exit inner loop & try again if (match): return pos return -1 # Never found one of this length
21
Palindrome Need to step back... 3 103 AAATTT 4 102 AAAATTTT
reverse = { }; reverse['A'] = 'T’ reverse['T'] = 'A’ reverse['G'] = 'C’ reverse['C'] = 'G’ patLen = 1 pos = findFirstPalindrome(1, text, patLen, reverse) while (pos > -1): print patLen, pos+1, text[pos:pos+(2*patLen) ] patLen = patLen + 1 start = max(pos-1, 0) # But don’t go past 0! pos = findFirstPalindrome(start, text, patLen, reverse) AAATTT AAAATTTT 5 101 TAAAATTTTA Need to step back...
22
Find all ORFs def findAllOrf(text, limit): lst = []
# Look for start in each of three reading frames. for offSet in xrange(3): pos = offSet while (pos > -1): [pos, y, ln] = findORF(text, pos, limit) if (pos > 0): # Go around again item = ["+", offSet+1, pos+1, y, ln] lst.append(item) pos = pos + ln # Go past the ORF return lst
23
Find one ORF # Look for an open codon followed by a close codon
# Returns [start, end, len] def findORF(text, pos, minOrf): while (pos < len(text)): if ("ATG" == text[pos:pos+3]): # Start of ORF y = isORF(text, pos, minOrf) if (y > -1): return [pos, y, y-pos] pos = pos + 3 # Didn't find anything return [-1, 0, 0]
24
isORF def isORF(text, pos, minOrf): y = pos + 3
while (y < len(text)): if (isStop(text[y:y+3])): y = y + 3 if ((y - pos) <= minOrf): # Too short! return -1 return y # Found full length ORF
25
Main Routine ... print "ORF must be at least", limit, "Base pairs long" text = cs58FileUtility.readFastaFile(fileName) lst = findAllOrf(text, limit) print("Direction ReadingFrame Low High Length”) for pos in range(len(lst)): print lst[pos] item = ["+", offSet+1, pos+1, y, ln] lst.append(item)
26
A. Acid Attributes Hydrophobic/Hydrophilic
Hydrophobic: repelled by water Hydrphilic: water soluable Aromatic / Aliphatic classification of carbon and hydrogen molecules Aromatic: contain rings, such as benzene rings Aliphatic: do not contain such rings Polar: separation of electric charge leading to a electric dipole, or separation of positive and negative charges
27
Venn Diagram Review Thanks to Snorg Tees
29
Amino Acid Properties
30
Surface Area vs Fractional Avail Area
31
Ph of isoelectric Point vs Hydrophobicity
32
Hydrophobicity Scales
33
Bulk vs Polarity
34
Patrick
35
Noel
36
Kyle
37
Hanna
38
Dario
39
?
40
Raisa
41
Jonathan
42
Brian
43
Kevin
44
Spreadsheets
45
What do sums and differences mean?
46
Spreadsheets What are the strong signals?
48
Summary Principal Component Analysis is a technique that takes data of multiple dimensions, and finds the “Principal Components” Can make it easier to make sense out of data Note that in this case, we have no evidence (yet) that the components that we have identified have any biological significance We might have started by measuring the wrong things We will get a chance to evaluate this distance matrix when we return to sequence alignment
49
Bayes Theorem If the patient has tested positive,
they are Sick (1,000) and test positive (810 individuals) or they are Well, and have a false positive (8217 individuals). Thus the odds are 810/( ) = 8.97%
50
Outline Homework Introduce Scoring Matrices
Some mismatches are better than others Solve Alignment with Affine Gap Penalties Continuing deletion more common than starting one Linear Space Alignment Multiple Alignment
51
Global Alignment: Key Ideas
1) Labeling of the upper and left edges 2) To compute the best match ending at location [i,j] we compute the three values below, pick minimal value, and store it in d[i][j] The costs for match, non-match and gap may be varied to match problem insertCost = d[i-1][j] - 1; deleteCost = d[i][j-1] - 1; if (str1[i] == str2[j]) diagCost = d[i-1][j-1] + 1; else diagCost = d[i-1][j-1] - 1; d[i][j] = max(insertCost, deleteCost, diagCost) T G T AT ATG AT_ 1 3 2 1 C AT_ ATC ATG ATC 1 1 2
52
Alignment Scoring #matches – μ(#mismatches) – σ (#indels)
Our algorithm permits the following scoring scheme: +1 : match premium -μ : mismatch penalty -σ : indel penalty The score for a particular alignment is #matches – μ(#mismatches) – σ (#indels)
53
Scoring Matrices To generalize scoring for DNA sequences, consider a (4+1) x (4+1) scoring matrix δ. We include a spot for the gap character “-”. For amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {
54
Making a Scoring Matrix
Scoring matrices for DNA are not as widely used We introduce two Scoring Matrices for Amino Acids Created based on biological evidence. Goal in alignment is to identify the underlying similarities hidden by mutations. Some mutations have little effect on the protein’s function, so some penalties, δ(vi , wj), should be less harsh than others. Two types of Amino acid substitution matrices in wide use PAM BLOSUM
55
Scoring Matrix: Small Example
K 5 -2 -1 - 7 3 6 While R (Arginine) and K (Lysine) are different amino acids, they have a positive score. Why? They are both positively charged amino acids, and a substitution will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11
56
The Blosum50 Scoring Matrix
57
Attributes Hydrophobic/Hydrophilic Hydrophobic: repelled by water
Hydrphilic: water soluable Aromatic / Aliphatic classification of carbon and hydrogen molecules Aromatic: contain rings, such as benzene rings Aliphatic: do not contain such rings Polar: separation of electric charge leading to a electric dipole, or separation of positive and negative charges The Cell: A molecular Approach Geoffery Cooper
58
Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine isoleucine
60
Assumptions See Apostolico and Giancarlo:
Sequence Alignment in Molecular Biology Today we will give a fixed amount for match, charge a fixed amount for mismatch, etc That is, all parts of the string are equally important Possible to construct alternative models: HMM to recognize and align sequences
61
Scoring Matrices We wish to compare sequences
We have been matching single symbols We have seen scoring matrices for Amino Acids We could bin symbols into other groups: Hydrophobic/Hydrophylic Pyrimidines/Purines We could compare pairs of symbols All are valid ways to score an alignment
62
Scoring Matrices Let's make some assumptions about scores
Additive – for ease of computation Positive is good Negative is bad Next slide: Must have at least one positive score Expected score must be negative
63
Scoring Matrices Assumptions
String S made of letters a1 to an Odds of seeing ai are pi Score for alignment of ai with aj is sij We assume some sij is positive Or best matches would have length 1 We assume expected value of score is negative Or matches would run on forever
64
Scoring Matrices Assumptions
String S made of letters a1 to an Odds of seeing ai are pi Score for alignment of ai with aj is sij We assume some sij is positive We assume expected value of score is negative How to compare scores from different matrices? Are matches from 2M twice those from M? We will find a way to normalize a matrix
65
Altschul-Dembo-Karlin statistics
Threshold: Identifies smallest segment score that is unlikely to happen by chance # matches above q has mean E(q) = Kmne-lq; K is a constant, m and n are the lengths of the two compared sequences Parameter l is positive root of:
66
Background Frequency Some residues appear more often than others
Notation: pi is probability of ith residue We may compute genome wide frequency We could also look at local frequency Let qij represent the odds of aligning ith with jth Should we give a high score or a low score when aligning with a rare residue?
67
Assumptions Should have at least one positive entry, or best scores will have length 1 Expected value of a score (aka Average score) should be negative, or we will tend to have very long matches Entries in scoring matrix should look like this
68
Why log? Probabilities should be multiplied P(AB) = P(A)P(B)
Odds of match in first place times odds of match in second place… But we want to get a score by adding Log turns products into sums
69
PAM PAM - (Dayhoff et al., 1978) Original substitution matrix
Compare closely related species Use Global Alignment Point Accepted Mutation 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed Some residues may have mutated several times Some residues may have returned to their original state Some residues may not changed at all
70
PAMX PAM250 = PAM1250 PAMx = PAM1x PAM250 is a widely used
One positive entry? Expected value negative?
71
PAM250 is a widely used scoring matrix:
72
BLOSUM Blocks Substitution Matrix (Henikoff & Henikoff, 1992)
Scores derived from observations of the frequencies of substitutions in Blocks of local alignments (no gaps) In distantly related proteins Weights contributions – don’t count similar species twice Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity
73
Recap PAM vs BLOSUM The PAM matrix was based on alignment of sequences from closely related species Includes conserved and mutable regions BLOSUM is based on highly conserved regions without gaps (Blocks) from distantly related species In PAM, higher numbers are used to align more distant sequences. Thus PAM250 is PAM1250. Higher BLOSUM numbers are for closer matches. BLOSUM62 is used for closer sequences than BLOSUM50
74
What does 62% mean? A C C T G A G – A G A C G T G – G C A G
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical
75
The Blosum50 Scoring Matrix
One positive entry? Expected value negative?
76
How do we use these? Alignment programs will give you a choice:
77
Blast Similarity Search
I live on a street that is one long block. How many digits do I need for the house number? If I lived on Commonwealth Avenue How many digits would I need for my house number? Log(N) bits for street with N houses I am searching for a DNA string of length 6 in EColi I have found an exact match. Is it significant?
78
Some Scoring Issues in BLAST
Scoring Matrices – how to pick? How did I set the gap cost? What does the score mean? Is this score relevant? As DB grows, expected value of max match grows Is the Database current? Is the Database redundant? Am I matching a low complexity region?
79
Outline Homework Introduce Scoring Matrices
Some mismatches are better than others Solve Alignment with Affine Gap Penalties Continuing deletion more common than starting one Linear Space Alignment Multiple Alignment
80
Gap Penalties Currently, a fixed penalty σ is given to every indel:
-σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc. This is called a "linear gap penalty" Linear in the size of the gap Too severe penalty for a series of 100 consecutive indels
81
Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC
In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events Our current scoring gives same score for both alignments ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely.
82
Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty Don't add as much of a penalty for extending the gap.
83
Affine Transformation
Affine Transformation Linear Transformation
84
Affine Gap Penalties Gap penalties: -ρ-σ when there is 1 indel
-ρ-2σ when there are 2 indels in a row -ρ-3σ when there are 3 indels in a row, etc. -ρ- x·σ (-gap opening - x gap extensions) New penalties for runs of horizontal or vertical edges
85
Affine Gap Penalties and Edit Graph
To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph. Each such edge of length x should have weight - - x *
86
Adding “Affine Penalty” Edges to the Edit Graph
There are many such edges! Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the number of vertices) So the complexity increases from O(n2) to O(n3) Can we do better?
87
Adding “Affine Penalty” Edges to the Edit Graph
There are many such edges! Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the number of vertices) So the complexity increases from O(n2) to O(n3) Can we do better?
88
The 3-leveled Manhattan Grid
Gaps in w Matches/Mismatches Gaps in v
89
Manhattan in 3 Layers ρ δ δ σ δ ρ δ δ σ
90
Switching between 3 Layers
Levels: The main level is for diagonal edges The lower level is for horizontal edges The upper level is for vertical edges A jumping penalty is assigned to moving from the main level to either the upper level or the lower level (-r) There is a gap extension penalty for each continuation on a level other than the main level (-s)
91
Affine Gap Penalties and 3 Layer Manhattan Grid
The three recurrences for the scoring algorithm creates a 3-layered graph. The top level creates/extends gaps in the sequence w. The bottom level creates/extends gaps in sequence v. The middle level extends matches and mismatches.
92
Affine Gap Penalties and 3 Layer Manhattan Grid
The three recurrences for the scoring algorithm creates a 3-layered graph. The top level creates/extends gaps in the sequence w. The bottom level creates/extends gaps in sequence v. The middle level extends matches and mismatches. As stated, no way to get from insertion to deletion
93
Affine Gap Penalty Recurrences
si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom
94
Outline Homework Introduce Scoring Matrices
Some mismatches are better than others Solve Alignment with Affine Gap Penalties Continuing deletion more common than starting one Linear Space Alignment Multiple Alignment
95
Divide and Conquer Divide and Conquer is a general strategy
95 Divide and Conquer Divide and Conquer is a general strategy Take a difficult problem and decompose it into two parts
96
Search for 6 How long would it take to find 6 in the list below? 5 8 3 2 5 9 6 1 5 8
97
Binary Search How long would it take to find 6 in the list below? What if we knew that the list was sorted? 5 8 3 2 5 9 6 1 5 8 1 2 3 5 5 5 6 8 8 9
98
Finding a root [0, 2] - midpoint is 1
98 Finding a root [0, 2] - midpoint is 1 f(1) = = -1, so interval [0, 1] is below axis, [1, 2] must cross [1,2] - midpoint is 3/2 f(3/2) = 9/4 - 2 = 1/4, so curve crosses in region [1, 3/2] [1, 1.5] - midpoint is 5/4 f(5/4) = 25/ /16 < 0, so curve crosses in region [1.25, 1.5] [1.25, 1.5] …. At each stage of this process, we halve the interval: needs about 3 iterations per digit. At each point, we halve the region holding the solution
99
Computing Alignment Path Requires Quadratic Memory
Space complexity for computing alignment path for sequences of length n and m is O(nm) We need to keep all backtracking references in memory to reconstruct the path (backtracking) m n
100
Divide and Conquer Approach
Path(source, sink) if(source & sink are in consecutive columns) output the longest path from source to sink else middle ← middle vertex between source & sink Path(source, middle) Path(middle, sink)
101
Divide and Conquer Approach to LCS
Path(source, sink) if(source & sink are in consecutive columns) output the longest path from source to sink else middle ← middle vertex between source & sink Path(source, middle) Path(middle, sink) The only problem left is how to find this “middle vertex”!
102
Computing Alignment Score with Linear Memory
Space complexity of computing just the score itself is O(n) We only need the previous column to calculate the current column We can then throw away that previous column once we’re done using it 2 n n
103
Computing Alignment Score: Recycling Columns
Only two columns of scores are saved at any given time memory for column 1 is used to calculate column 3 memory for column 2 is used to calculate column 4
104
Computing Alignment Score with Linear Memory
Space complexity of computing just the score itself is O(n) Only need the previous column to calculate the current column This computes the global score Does not remember how we got here 2 n n
105
Crossing the Middle Line
We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the i-th row Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2) m/ m n (i, m/2) Prefix(i) Suffix(i)
106
Crossing the Middle Line
m/ m n (i, m/2) Prefix(i) Suffix(i) Define (mid,m/2) as the vertex where the longest path crosses the middle column. length(mid) = optimal length = max0i n length(i)
107
Computing Prefix(i) prefix(i) is the length of the longest path from (0,0) to (i,m/2) Compute prefix(i) by dynamic programming in the left half of the matrix store prefix(i) column m/ m
108
Computing Suffix(i) suffix(i) is the length of the longest path from (i,m/2) to (n,m) suffix(i) is the length of the longest path from (n,m) to (i,m/2) with all edges reversed Compute suffix(i) by dynamic programming in the right half of the “reversed” matrix store suffix(i) column m/ m
109
Length(i) = Prefix(i) + Suffix(i)
Add prefix(i) and suffix(i) to compute length(i): length(i)=prefix(i) + suffix(i) You now have a middle vertex of the maximum path (i,m/2) as maximum of length(i) i middle point found m/2 m
110
Finding the Middle Point
m/ m/ m/ m
111
Finding the Middle Point again
m/ m/ m/ m
112
And Again 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
113
Time = Area: First Pass On first pass, the algorithm covers the entire area Area = nm
114
Time = Area: First Pass On first pass, the algorithm covers the entire area Area = nm Computing prefix(i) Computing suffix(i)
115
Time = Area: Second Pass
On second pass, the algorithm covers only 1/2 of the area Area/2
116
Time = Area: Third Pass On third pass, only 1/4th is covered. Area/4
117
Geometric Reduction At Each Iteration
1 + ½ + ¼ (½)k ≤ 2 Runtime: O(Area) = O(nm) 5th pass: 1/16 3rd pass: 1/4 first pass: 1 4th pass: 1/8 2nd pass: 1/2
118
Outline Homework Introduce Scoring Matrices
Some mismatches are better than others Solve Alignment with Affine Gap Penalties Continuing deletion more common than starting one Linear Space Alignment Multiple Alignment
119
Multiple Alignment Dynamic Programming in 3-D Progressive Alignment
Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment Partial Order Alignment (POA) A-Bruijin (ABA) Approach to Multiple Alignment
120
Generalizing Pairwise Alignment
Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal Alignment of 2 sequences represented as a 2-row matrix Represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment
121
Aligning Three Sequences
source Same strategy as aligning two sequences Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align For global alignments, go from source to sink sink
122
2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph
123
2-D cell versus 3-D Alignment Cell
In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube
124
Alignment Paths Resulting path in (x,y,z) space:
1 2 3 4 x coordinate A -- T G C y coordinate 1 2 3 4 A T -- C 1 2 3 4 z coordinate -- A T G C Resulting path in (x,y,z) space: (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
125
Multiple Alignment: Dynamic Programming
cube diagonal: no indels si,j,k = max (x, y, z) is an entry in the 3-D scoring matrix si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k + (vi, wj, _ ) si-1,j,k (vi, _, uk) si,j-1,k (_, wj, uk) si-1,j,k + (vi, _ , _) si,j-1,k + (_, wj, _) si,j,k (_, _, uk) face diagonal: one indel edge diagonal: two indels
126
Architecture of 3-D Alignment Cell
(i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i-1,j,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)
127
Multiple Alignment: Running Time
For 3 sequences of length n, the run time is 7n3; O(n3) For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time
128
Inferring Pairwise Alignments from Multiple Alignments
From a multiple alignment, we can infer pairwise alignments between all sequences, but they are not necessarily optimal We simply project a 3-D multiple alignment path on to a 2-D face of the cube
129
Multiple Alignment Projections
A 3-D alignment can be projected onto the 2-D plane to represent an alignment between a pair of sequences. All 3 Pairwise Projections of the Multiple Alignment
130
MA Induces Pairwise Alignments
Every multiple alignment induces pairwise alignment x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
131
MA from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG Can we construct MA that induces them?
132
MA from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG Can we construct MA that induces them? Not always Pairwise alignments may be inconsistent
133
Combining Optimal Pairwise Alignments into MA
Can combine pairwise alignments into multiple alignment Can not combine pairwise alignments into multiple alignment A < T < G < A
134
Inferring MA from Pairwise Alignments
From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences
135
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T
136
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile?
137
Aligning alignments Given two alignments, can we align them?
x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----
138
Aligning alignments Given two alignments, can we align them?
Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----
139
Example We have aligned three sequences x, y, z
Wish to align new sequence w x AC-GT y GC-AT z ACCGT w ACG
140
Example Costs Exact match +2 Transitions +1 (A-G, C-T)
We have aligned three sequences x, y, z Wish to align new sequence w Costs Exact match +2 Transitions +1 (A-G, C-T) Transversion (A or G to C or T) Gap cost -3 Matching Gap +1 x AC-GT y GC-AT z ACCGT w ACG
141
Example Reduce alignment to a profile A 2/3 1/3 C 1 1/3 G 1/3 2/3 T 1
- 2/3 Costs Exact match +2 Transitions +1 (A to G, C to T) Transversion -1 (A, G to C ,T) Gap cost -3 Matching Gap +1 x A C - G T y G C - A T z A C C G T
142
Example Costs Gap cost -3 A 2/3 1/3 C 1 1/3 G 1/3 2/3 T 1 - 2/3 -3 -6
Exact match +2 Transitions +1 (A-G, C-T) Transversion -1 (A or G to C or T) Gap cost -3 Matching Gap +1 T 1 - 2/3 -3 -6 -9 -12 -15 -3 A -6 C -9 G
143
Example Costs A 2/3 1/3 C 1 1/3 G 1/3 2/3 T 1 - 2/3 -3 -6 -9 -12 -15
Exact match +2 Transitions +1 (A-G, C-T) Transversion -1 (A or G to C or T) Gap cost -3 Matching Gap +1 Up -6 Left -6 Diag ? T 1 - 2/3 -3 -6 -9 -12 -15 -3 ? A -6 C -9 G
144
Example Costs A 2/3 1/3 C 1 1/3 G 1/3 2/3 Exact match +2
Transitions +1 (A-G, C-T) Transversion -1 (A or G to C or T) Gap cost -3 Matching Gap +1 Up -6 Left -6 Diag x 2/3 + 1 x 1/3 T 1 - 2/3 -3 -6 -9 -12 -15 -3 5/3 A -6 C -9 G
145
Example Costs A 2/3 1/3 C 1 1/3 G 1/3 2/3 Exact match +2
Transitions +1 (A-G, C-T) Transversion -1 (A or G to C or T) Gap cost -3 Matching Gap +1 Up -9 Left -4/3 Diag T 1 - 2/3 -3 -6 -9 -12 -15 -3 5/3 -4/3 ? A -6 C -9 G
146
Example Costs A 2/3 1/3 C 1 1/3 G 1/3 2/3 Gap cost -3 Matching Gap +1
Exact match +2 Transitions +1 (A-G, C-T) Transversion -1 (A or G to C or T) Gap cost -3 Matching Gap +1 Up -12 Left -4/3 + 1 x 2/3 -3 x 1/3 Diag x 1/3 -3 x 2/3 T 1 - 2/3 -3 -6 -9 -12 -15 -3 5/3 -4/3 -5/3 A -6 C -9 G
147
Example Costs A 2/3 1/3 C 1 1/3 G 1/3 2/3 T 1 - 2/3 -3 -6 -9 -12 -15
Exact match +2 Transitions +1 (A-G, C-T) Transversion (A or G to C or T) Gap cost -3 Matching Gap +1 x AC-GT y GC-AT z ACCGT w AC-G- T 1 - 2/3 -3 -6 -9 -12 -15 -3 5/3 -4/3 -5/3 -14/3 -23/3 A -6 -4/3 11/3 10/3 1/3 -8/3 C -9 -13/3 2/3 4/3 5 2 G
148
Multiple Alignment: Greedy Approach
Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k
149
Greedy Approach: Example
Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC
150
Greedy Approach: Example (cont’d)
There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)
151
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCt/aGa/cA (profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c
152
Iterative Alignment Allows you to modify the initial alignments as you add sequences Does not force you to live with your initial choices.
153
References Apostolico A1, Giancarlo R., Sequence alignment in molecular biology, J Comput Biol Summer; 5(2): Dayhoff, M. O.; Schwartz, R. M.; Orcutt, B. C. (1978). "A model of evolutionary change in proteins". Atlas of Protein Sequence and Structure 5 (3): 345–352. Henikoff, Steven; Henikoff, Jorja (1992). "Amino acid substitution matrices from protein blocks". Proceedings of the National Academy of Sciences of the United States of America 89 (22): 10915–9. Altschul, SF (1991). "Amino acid substitution matrices from an information theoretic perspective". Journal of molecular biology 219 (3): 555–65.
154
Summary We have been able to use the same Dynamic Programming Framework to address a number of problems Global Alignment (0, -1, -2, -3 on both axes) Pattern Matching (0, -1, -2, -3 on one axis, 0 on the other) Local Alignment (zeros everywhere – axes and interior) We have been able to fold in Affine Gap penalty as well. We have seen ways to modify the matching score to provide more realistic matching scores: PAM and BLOSUM Taken a brief look at a hard problem: multiple alignment We have seen a linear space algorithm for alignment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.