Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignment III CIS 667 February 10, 2004.

Similar presentations


Presentation on theme: "Sequence Alignment III CIS 667 February 10, 2004."— Presentation transcript:

1 Sequence Alignment III CIS 667 February 10, 2004

2 Extensions to the Basic Algorithm We have seen that the basic dynamic programming algorithm can be used to solve  Global alignment  Semi-global alignment  Local alignment We can extend the algorithm to more accurately reflect the cost of gap penalties

3 General Gap Penalty Functions Gaps are caused by mutations  It is more likely to have a single large gap than several smaller ones  Gap penalties should reflect that  Let w(k) denote a gap penalty function (the cost of a gap of k spaces)  We have been using w(k) = bk - a linear function

4 General Gap Penalty Functions We can modify the basic algorithm to compute a score with a general gap penalty w (i.e. any function) The modified algorithm is slower, however - O(n 3 )  The new algorithm scoring scheme is no longer additive  The space required is also larger

5 Affine Gap Penalty Functions Can we do better than O(n 3 ) and still have a reasonable function?  Yes. We need to have w(k)  kw(1)  An affine function - w(k) = h + gk with w(0) = 0 and h, g > 0 works  Think of h as the cost of opening a gap and g as the cost of extending a gap  We can develop an algorithm with time complexity O(n 2 )

6 Gap Penalties - Overview Imagine we want to align: CAGT CCAAGGTTCAGT Bad alignment: C-A-G-T----- CCAAGGTTCAGT Better alignment: --------CAGT CCAAGGTTCAGT Gap cost with linear gap penalty (-2) -16 Gap cost with affine gap penalty (h = -2, k = -1) -12 -9

7 Multiple Sequence Alignment Once a protein sequence is newly determined, an important goal is to assign possible functions to it  First search for similar sequences in the DNA and protein sequence databases  If more than one similar sequence is found, the next step is to multiply align all of the sequences

8 Multiple Sequence Alignment Multiple alignments are key starting point for  Protein secondary structure prediction  Residue accessibility  Function Also provide the basis for the most sensitive sequence searching algorithms

9 Multiple Sequence Alignment A multiple sequence alignment is simply an alignment that contains more than two sequences MPQILLL MLR-LL- MK-ILLL MPPVLIL

10 Multiple Sequence Alignment We must decide how to score a multiple alignment One possibility is the sum-of-pairs function  Simply add up the pairwise scores of all pairs in a column to get the score of the column  Note that in multiple sequence alignment we may have two spaces in a column - the score of (-,-) then is usually set to 0

11 Multiple Sequence Alignment A straightforward dynamic programming approach to multiple sequence alignment results in an exponential algorithm  Heuristics can be used to reduce the complexity in most cases

12 Multiple Sequence Alignment Automatic alignment programs such as CLUSTAL W can be used to produce multiple alignments The PSI-BLAST program uses multiple sequence alignments to make more sensitive searches of protein sequence databases than is possible with a single sequence

13 PAM Matrices When comparing protein sequences, we need a more complex scoring scheme  A mismatch with two amino acids with similar biochemical properties should score higher than one with two dissimilar ones  Evolution is more likely to result in a similar amino acid (e.g. same size, both hydrophobic, etc.) replacing another

14 PAM Matrices PAM - Point Accepted Mutations or Percent of Accepted Mutations  1-PAM matrix reflects an amount of evolution producing on average one mutation per hundred amino acids  250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart  Works well for long, weakly similar sequences  Small values good for short, similar sequences

15 PAM-250 Matrix

16 BLOSUM Matrices Another widely used set of matrices is BLOSUM - Blocks Substitution Matrix  BLOSUM is often better for highly divergent sequences  PAM better for more highly similar sequences

17 BLAST BLAST - Basic Local Alignment Search Tool is a family of sequence similarity tools  Can be used to search sequence databases worldwide  Can be run locally, or via web-based interface on a server  Given a query sequence, BLAST returns all matches above a user-defined threshold

18 BLAST BLAST uses a heuristic technique  Compile list of high-scoring words (use PAM matrix to score words w characters long)  Search for matches in the database (use a hash table to speed up search) - call a match a seed  Extend the seeds in both directions until the score of the extension falls below a limit


Download ppt "Sequence Alignment III CIS 667 February 10, 2004."

Similar presentations


Ads by Google