Download presentation
Presentation is loading. Please wait.
1
Sequence Alignment III CIS 667 February 10, 2004
2
Extensions to the Basic Algorithm We have seen that the basic dynamic programming algorithm can be used to solve Global alignment Semi-global alignment Local alignment We can extend the algorithm to more accurately reflect the cost of gap penalties
3
General Gap Penalty Functions Gaps are caused by mutations It is more likely to have a single large gap than several smaller ones Gap penalties should reflect that Let w(k) denote a gap penalty function (the cost of a gap of k spaces) We have been using w(k) = bk - a linear function
4
General Gap Penalty Functions We can modify the basic algorithm to compute a score with a general gap penalty w (i.e. any function) The modified algorithm is slower, however - O(n 3 ) The new algorithm scoring scheme is no longer additive The space required is also larger
5
Affine Gap Penalty Functions Can we do better than O(n 3 ) and still have a reasonable function? Yes. We need to have w(k) kw(1) An affine function - w(k) = h + gk with w(0) = 0 and h, g > 0 works Think of h as the cost of opening a gap and g as the cost of extending a gap We can develop an algorithm with time complexity O(n 2 )
6
Gap Penalties - Overview Imagine we want to align: CAGT CCAAGGTTCAGT Bad alignment: C-A-G-T----- CCAAGGTTCAGT Better alignment: --------CAGT CCAAGGTTCAGT Gap cost with linear gap penalty (-2) -16 Gap cost with affine gap penalty (h = -2, k = -1) -12 -9
7
Multiple Sequence Alignment Once a protein sequence is newly determined, an important goal is to assign possible functions to it First search for similar sequences in the DNA and protein sequence databases If more than one similar sequence is found, the next step is to multiply align all of the sequences
8
Multiple Sequence Alignment Multiple alignments are key starting point for Protein secondary structure prediction Residue accessibility Function Also provide the basis for the most sensitive sequence searching algorithms
9
Multiple Sequence Alignment A multiple sequence alignment is simply an alignment that contains more than two sequences MPQILLL MLR-LL- MK-ILLL MPPVLIL
10
Multiple Sequence Alignment We must decide how to score a multiple alignment One possibility is the sum-of-pairs function Simply add up the pairwise scores of all pairs in a column to get the score of the column Note that in multiple sequence alignment we may have two spaces in a column - the score of (-,-) then is usually set to 0
11
Multiple Sequence Alignment A straightforward dynamic programming approach to multiple sequence alignment results in an exponential algorithm Heuristics can be used to reduce the complexity in most cases
12
Multiple Sequence Alignment Automatic alignment programs such as CLUSTAL W can be used to produce multiple alignments The PSI-BLAST program uses multiple sequence alignments to make more sensitive searches of protein sequence databases than is possible with a single sequence
13
PAM Matrices When comparing protein sequences, we need a more complex scoring scheme A mismatch with two amino acids with similar biochemical properties should score higher than one with two dissimilar ones Evolution is more likely to result in a similar amino acid (e.g. same size, both hydrophobic, etc.) replacing another
14
PAM Matrices PAM - Point Accepted Mutations or Percent of Accepted Mutations 1-PAM matrix reflects an amount of evolution producing on average one mutation per hundred amino acids 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart Works well for long, weakly similar sequences Small values good for short, similar sequences
15
PAM-250 Matrix
16
BLOSUM Matrices Another widely used set of matrices is BLOSUM - Blocks Substitution Matrix BLOSUM is often better for highly divergent sequences PAM better for more highly similar sequences
17
BLAST BLAST - Basic Local Alignment Search Tool is a family of sequence similarity tools Can be used to search sequence databases worldwide Can be run locally, or via web-based interface on a server Given a query sequence, BLAST returns all matches above a user-defined threshold
18
BLAST BLAST uses a heuristic technique Compile list of high-scoring words (use PAM matrix to score words w characters long) Search for matches in the database (use a hash table to speed up search) - call a match a seed Extend the seeds in both directions until the score of the extension falls below a limit
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.