Download presentation
Presentation is loading. Please wait.
1
Sequence similarity
2
Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?
3
Define alignment Align these two sequences optimally GACGGATT GATCGGTT Define precisely what an alignment is
4
Dot plot Best path from UL to LR?
5
Edit (Levenshtein) distance An alignment of sequences s,t can be created by a series of edit operations –Insert space in s opposite letter in t –Insert space in t opposite letter in s
6
Definition of alignment Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT Don’t allow spaces to line up Allow spaces even at beginning and end GCAT- -CATG
7
Define similarity Given an alignment, compute a similarity score Three possibilities for each column i letter-letter match ii letter-letter mismatch iii letter-space mismatch (Can you transform ii into iii?)
8
Optimal alignment Create score function For example: +1 bonus for match -1 penalty for letter-space mismatch
9
Dynamic programming solution Given sequences s,t of length m,n Strategy: build up optimal alignment of prefixes Base case? Recurrence relation?
10
Recurrence Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] Three possibilities: –extend s by a letter, t by a space –extend s by a letter, t by a letter –extend s by a space, t by a letter Choose the one with the best score
11
Tiny instance -- AGC, AAAC 0-2-3-4 -2 -3
12
Some dp details What is a good order to fill the array? How do you recover the opt alignment? What do you do about ties? What is the space complexity of this algorithm? What is the time complexity of this algorithm?
13
The gap penalty Model above assumes two gaps of size 1 are equivalent to one gap of size 2 Is this realistic? Why or why not?
14
General gap penalties Alignments can no longer be scored as the sum of their parts They still are the sum of blocks with one matched letter or one gap each Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T
15
DP for general gaps Requires three arrays, one for each block type Time complexity is cubic This is expensive at best, prohibitive for large problems
16
Affine gap penalty Charge h for each gap, plus g * (len(gap)) This still has quadratic complexity!
17
Point accepted mutations Some mutations are more likely than others In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) A point accepted mutation matrix is a table with probabilityof each transition in fixed time
18
PAM matrices The entire matrix sums to 1 A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change
19
Scoring matrix Consider aligned letters a,b Pr(b is a mutation of a) = M ab Pr(b is a random occurrence) = p b Score(a,b) = 10log(M ab / p b )
20
Blast Basic Local Alignment Search Tool Def: ‘segment’ is a subsequence (without gaps) Def: ‘segment pair’ is two segments of equal length Rem: the score of a segment pair is the sum of its aligned letters
21
What Blast does Input: –a PAM matrix –a database of sequences B –a query sequence A –a threshhold S Output: –all segment pairs(A,B) with score > S
22
How Blast works Compile short, high-scoring strings (words) Search for hits -- each hit gives a seed Extend seeds
23
Blast on proteins Words are w-mers which score at least T against A Use hashing or dfa to search for hits Extend seed until heuristically determined limit is reached
24
Blast on nucleic acids Words are w-mers in query A Letters compressed, four to byte Filter database B for very common words to avoid false positives Extend seeds as in proteins
25
What does Blast give you? Efficiency A rigorous statistical theory which gives the probability of a segment pair occurring by chance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.