Download presentation
Presentation is loading. Please wait.
Published byRandall Harvey Modified over 8 years ago
1
Sequence Alignment
2
Assignment Read Lesk, 160-194 Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you don’t find an exact answer, how tight of a Big-O bound can you derive? (optional)
3
The ‘bio’ statement of the problem Given two or more sequences: Measure their similarity Establish a correspondence Find conserved and varied locations Infer evolutionary relationships
4
The ‘cs’ statement Given two or more sequences: Establish an optimal residue-residue correspondence
5
Definition of alignment An alignment is a set of correspondences between pairs of residues which preserves their order. Example: a b c d e – a – c d e f Note: gaps are permitted in both sequences
6
Definition of ‘optimal’ Requires a scoring system May include positive and negative value Best (highest scoring) of all possible values Question: given two sequences of length n, how many alignments are possible?
7
Dot plots W H I R L I N G W H I I I R L I G I I I G
8
Dot plots W H I R L I N G W H I I I R L I G I I I G
9
Dotplots and alignments Dotplots are visual representations of similarity Any path from upper left to lower right, using only S, E and SE moves, is an alignment
10
Edit distance The minimal number of edit operations (insert/delete, change) to transform one sequence to another Operations can be weighted: –Indels by length –Transformations by type
11
A weighted scheme Transitions (a g, c t) are more common than transversions a t g c a 20 10 5 5 c 10 20 5 5 g 5 5 20 10 t 5 5 10 20
12
Gap penalties For DNA alignment, CLUSTAL-W uses: –+1 for a match –0 for a mismatch –10 for gap initiation –0.1 for gap extension
13
Dynamic programming Gives global optimum Takes 0(nm) time Doesn’t distinguish among equal-scoring alignments
14
Variations on the question Small sequence vs small sequence (how close are these two?) A small sequence against a very long sequence (Is this gene’s relative in the database?) Closest subsequences (does these sequences share a motif?)
15
Blast-style searches Answers the ‘relative’ question Heuristic (but statistically good, for the simplest model) Method: –Find local alignments –Find paths close to local alignments
16
P score Probability that alignment would arise by chance What if short vs long search gives a P-value of 10E-2? 10E-4?
17
Z-score, E-value Z-value is measure of ‘unlikelihood’ of match, from known mean and deviation E-value is expected number of sequences that give same Z-score or better with random probe E is usual Blast statistic E <= 0.02 is ‘good’
18
The Blast family Blast Blastp (protein-protein) Blastx (nucleotide-protein) Tblastn (amino-nucleotide) Tblastx (n-n) Psi-blast (improved a-a)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.