Download presentation
Presentation is loading. Please wait.
1
Sequence similarity (II)
2
Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA structure April 20RNA structure April 27team rpts May 4team rpts May 11final (in class)
3
General gap penalties Alignments can no longer be scored as the sum of their parts They still are the sum of blocks with one matched letter or one gap each Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T
4
Smith Waterman – local alignment
5
DP for general gaps Requires three arrays, one for each block type Time complexity is cubic This is expensive at best, prohibitive for large problems
6
Affine gap penalty Charge h for each gap, plus g * (len(gap)) This still has quadratic complexity!
7
Point accepted mutations Some mutations are more likely than others In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) A point accepted mutation matrix is a table with probability of each transition in fixed time
8
PAM matrices The entire matrix sums to 1 A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change
9
Scoring matrix Consider aligned letters a,b Pr(b is a mutation of a) = M ab Pr(b is a random occurrence) = p b Score(a,b) = 10log(M ab / p b )
10
Blast Basic Local Alignment Search Tool Def: ‘segment’ is a subsequence (without gaps) Def: ‘segment pair’ is two segments of equal length Rem: the score of a segment pair is the sum of its aligned letters
11
What Blast does Input: –a PAM matrix –a database of sequences B –a query sequence A –a threshhold S Output: –all segment pairs(A,B) with score > S
12
How Blast works Compile short, high-scoring strings (words) Search for hits -- each hit gives a seed Extend seeds
13
Z-scores Given an alignment of A, B, how significant is it? Permute A many times Align each permutation with B Collect the scores Z-score = score – mean / standard deviation
14
Blast on proteins Words are w-mers which score at least T against A Use hashing or dfa to search for hits Extend seed until heuristically determined limit is reached
15
Blast on nucleic acids Words are w-mers in query A Letters compressed, four to byte Filter database B for very common words to avoid false positives Extend seeds as in proteins
16
What does Blast give you? Efficiency A rigorous statistical theory which gives the probability of a segment pair occurring by chance
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.