Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.

Similar presentations


Presentation on theme: "Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA."— Presentation transcript:

1 Sequence similarity (II)

2 Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA structure April 20RNA structure April 27team rpts May 4team rpts May 11final (in class)

3 General gap penalties Alignments can no longer be scored as the sum of their parts They still are the sum of blocks with one matched letter or one gap each Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T

4 Smith Waterman – local alignment

5 DP for general gaps Requires three arrays, one for each block type Time complexity is cubic This is expensive at best, prohibitive for large problems

6 Affine gap penalty Charge h for each gap, plus g * (len(gap)) This still has quadratic complexity!

7 Point accepted mutations Some mutations are more likely than others In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) A point accepted mutation matrix is a table with probability of each transition in fixed time

8 PAM matrices The entire matrix sums to 1 A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

9 Scoring matrix Consider aligned letters a,b Pr(b is a mutation of a) = M ab Pr(b is a random occurrence) = p b Score(a,b) = 10log(M ab / p b )

10 Blast Basic Local Alignment Search Tool Def: ‘segment’ is a subsequence (without gaps) Def: ‘segment pair’ is two segments of equal length Rem: the score of a segment pair is the sum of its aligned letters

11 What Blast does Input: –a PAM matrix –a database of sequences B –a query sequence A –a threshhold S Output: –all segment pairs(A,B) with score > S

12 How Blast works Compile short, high-scoring strings (words) Search for hits -- each hit gives a seed Extend seeds

13 Z-scores Given an alignment of A, B, how significant is it? Permute A many times Align each permutation with B Collect the scores Z-score = score – mean / standard deviation

14 Blast on proteins Words are w-mers which score at least T against A Use hashing or dfa to search for hits Extend seed until heuristically determined limit is reached

15 Blast on nucleic acids Words are w-mers in query A Letters compressed, four to byte Filter database B for very common words to avoid false positives Extend seeds as in proteins

16 What does Blast give you? Efficiency A rigorous statistical theory which gives the probability of a segment pair occurring by chance


Download ppt "Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA."

Similar presentations


Ads by Google