Presentation is loading. Please wait.

Presentation is loading. Please wait.

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov.

Similar presentations


Presentation on theme: "9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov."— Presentation transcript:

1 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19

2 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs2 √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models Chp 6 - pp 79-84 Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Fri Sept 21 - EXAM 1 Required Reading (before lecture)

3 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs3 Assignments & Announcements √Sun Sept 16 - Study Guide for Exam 1 was posted √Mon Sept 17 - Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming?

4 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs4 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment √Scoring Function √Exhaustive Algorithms Heuristic Algorithms Star Alignment Clustal √Practical Issues First, review MSA scoring briefly, then back to Star Alignment & ClustalW

5 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs5 Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m Gap penalty AFPGQIKAFPGQIK FFFIYYYFFFIYYY GGQGQGKGGQGQGK FFFIDDDFFFIDDD AFPGQIKAFPGQIK FFFIDDDFFFIDDD WWWWWWWWWWWWWW FFFII--FFFII-- AFPGQIKAFPGQIK ---IDDD---IDDD GGGGGGGGGGGGGG -FFIYYY-FFIYYY

6 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs6 Sum of Pairs (SP) Score SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix Compute for each column c: S(m i ) =  k<l s(m i k, m i l ) AFPGAFPG FFFIFFFI GGQGGGQG FFFIFFFI AFPGAFPG FFFIFFFI WWWWWWWW FFI-FFI- AFPGAFPG --DD--DD GGGGGGGG -FFY-FFY FFI-FFI- mimi PAM or BLOSUM score residue l

7 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs7 Example: Calculating SP Score FYGD F5-2 Y71-5 G4-3 D5 S(m) = S(m 1 ) + S(m 2 ) + S(m 3 ) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 Gap penalty = -8 s(-,-) = 0 BLOSUM 60 F - G F Y D M = GGDGGD m1m1 m2m2 m3m3 I added more colors to this slide

8 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs8 Algorithms & Software for MSA? #1 Exhaustive Methods √ Multidimensional dynamic programming (DP) Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods Progressive (Star Alignment, Clustal) Iterative Block-based

9 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs9 Dynamic Programming for MSA As with pairwise alignments, MSAs can be computed by dynamic programming* F 2D 3D *(if you're not in a rush!)

10 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs10 Generalized Needleman-Wunsch Algorithm Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) +  (x i, y j, z k ), S(i-1, j-1, k ) +  (x i, y j, - ), S(i-1, j, k-1) +  (x i, -, z k ), S(i-1, j, k ) +  (x i, -, - ), S(i, j-1, k-1) +  ( -, y j, z k ), S(i, j-1, k ) +  ( -, y j, -), S(i, j, k-1) +  ( -, -, z k ) ) 3D

11 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs11 Given k sequences of length n Space for matrix: O(n k ) Neighbors/cell: 2 k -1 Time to compute SP score: O(k 2 ) Overall runtime: O(k 2 2 k n k )  Wow!!! 3D What Happens to Computational Complexity ?

12 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs12 What's so bad about those exponents? Example: Running Time of DP for MSA Overall runtime: O(k 2 2 k n k ) # SequencesRunning Rime 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences? Globins only »150 aa !! But: There are fast heuristics

13 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs13 Progressive Alignment Heuristic procedure: 1.Align most similar sequences first 2.Add sequences progressively Often: use guide tree to determine order of alignments 2 Examples: Star Alignment ClustalW Multiple Alignment by adding sequences 1 2 3 4

14 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs14 Guide Trees Binary tree Leaves correspond to sequences Internal nodes represent alignments Root corresponds to final MSA ATCATGTCG ATC ATG ATC- ATG- -TCC TCC TCG TCC -TCG

15 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs15 Star Alignment - skipped on Monday: will NOT be covered on Exam 1 Back to 2 Examples of Progressive Alignment Heuristics for MSA: 1.STAR Alignment 2.Clustal

16 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs16 Star Alignment Fast heuristic to compute MSA Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality Algorithm: 1.Compute pairwise similarities 2.Select center s c that maximizes Σ i  c S(s c,s i ) 3.Add sequences in decreasing order of similarity to center s c 4.Produce a multiple alignment M such that, for every i, the induced pairwise alignment of s c and s i is same as the optimal alignment of s c and s i

17 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs17 Does that function look familiar? Step 2 - Select center s c that maximizes Σ i  c S(s c,s i ) FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Steiner consensus sequence or string: Given sequences s 1,…, s k, find a sequence s* that maximizes Σ i S(s*,s i ) "String" equivalent of arithmetic mean: consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA

18 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs18 Step 3 - Add sequences in decreasing order of similarity to center s c s2s2 s1s1 s3s3 s4s4 s 1 : MPE s 2 : MKE s 3 : MSKE s 4 : SKE MPE | MKE MSKE | || M-KE MKE || SKE MSKE M-KE M-PE MSKE M-KE S-KE M-PE MSKE M-KE S 2 +S 3 +S 1 +S 4

19 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs19 Step 4 - Produce a multiple alignment M such that for every i: the induced pairwise alignment of s c and s i is same as optimal alignment of s c and s i S c AA--CCTT S 1 AATGCC-- S c A-ACC-TT S 2 AGACCGT- S 1 A-ATGCC--- S c A-A--CC-TT S 2 AGA--CCGT-

20 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs20 Complexity of Star Alignment? Given k sequences of length n, and an upper bound l for alignment length We need: O(k 2 n 2 ) to compute the alignments O(k 2 ) to compute the center O(k 2 l) to build multiple alignment Overall: O(k 2 n 2 ) Duh - Is this really much better than O(k 2 2 k n k )? YES! Remember: k = # of sequences n = length of sequences

21 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs21 CLUSTAL: Overview Progressive Alignment Pairwise Alignments 1 + 2 3 + 4 1 + 3 1 + 4 2 + 4 2 + 3 Guide Tree 12341234 2 3 4 1 1 2 3 4 5 1234512345 Distance Matrix 1.Compute pairwise alignments (DP) 2.Convert similarities into distances Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) 3.Build guide tree from distances by Neighbor Joining 4.Align with respect to guide tree

22 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs22 CLUSTAL: Example 1 2 3 4 5

23 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs23 One "small" problem? Finding the Guide Tree Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Guide Tree 12341234 1 2 3 4 5 1234512345 Distance Matrix Problem: Such a tree might not exist!

24 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs24 CLUSTAL W Tree Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83

25 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs25 Algorithms & Software for MSA? #2 √ Exhaustive Methods Multidimensional dynamic programming (DP) Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods √Progressive (Star Alignment, Clustal) Iterative Block-based

26 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs26 Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued Progressive alignments (Star Alignment, Clustal) Others: T-Coffee, DbClustal -see text: can be better than Clustal Match closely-related sequences first using a guide tree Partial order alignments (POA) Doesn't rely on guide tree; adds sequences in order given PRALINE Preprocesses input sequences by building profiles for each Iterative methods Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) Block-based Alignment Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) Local alignments Profiles, Blocks, Patterns - more on these soon!

27 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs27 Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs √Position Specific Scoring Matrices (PSSMs) √PSI-BLAST First, review above briefly, then: Profiles Markov Models & Hidden Markov Models

28 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs28 PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) Position Specific Iterated BLAST Intuition: substitution matrices should be "sensitive" to protein context e.g., larger penalty for Ala → Gly substitution if in a helix rather than in a loop Basic idea: Use BLAST with high stringency to generate a set of closely related sequences Align those sequences to create a new substitution matrix for each position Use this matrix (iteratively) to find additional sequences

29 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs29 PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold Position-Specific Scoring Matrix Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM.

30 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs30 What is a PSSM? Position-Specific Scoring Matrix A PSSM is: a representation of a motif an n by m matrix, where n is size of alphabet & m is length of sequence a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3 20 letter alphabet 8 residue sequence “K” at position 3 gets a score of 2 Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent I added more text to this slide Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA

31 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs31 Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3

32 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs32 Creating a PSSM from 1 Sequence A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3 BLOSUM62 matrix RNRGQFGH R R 20 by 20 20 by L L

33 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs33 Creating a PSSM from Multiple Sequences 1.Discard columns that contain gaps in query sequence 2.Compute relative sequence weights 3.Compute PSSM entries, taking into account Observed residues in column Sequence weights Substitution matrix

34 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs34 1- Discard Columns with Gaps in Query EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA

35 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs35 2- Compute Sequence Weights Smaller weights are assigned to redundant sequences Larger weights are assigned to unique sequences EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 How are weights determined? Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters Info re: weights was added to this slide

36 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs36 3- Compute PSSM Entries (simplified version) EQRGKAFAEQRGKAFA PSSM Background frequencies A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 Observed residues PSSM column = Usually derived from large sequence database / This slide was modified

37 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs37 PSSM Entries = Log-Odds Scores Observed frequency of residue “A” Foreground model (i.e., the PSSM) Background model 1.Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2.Divide by background probability of observing each residue (probability of A given B, where B is background model) 3.Take log so that can add (rather than multiply) scores This slide was modified

38 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs38 Why (not) PSI-BLAST? Psi-BLAST weights sequences according to observed diversity specific to family under investigation Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits

39 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs39 How to Use PSI-BLAST Effectively Set initial thresholds high Inspect each iteration's result for suspicious sequences ( When in doubt, leave it out!) Do several iterations (~5), or until no new sequences are found Make initial search very broad First, use NR (large, inclusive database) with up to 5 iterations to set PSSM Then use that PSSM to search in a more restricted domain, if possible Be particularly cautious about matches to sequences with highly biased amino acid content

40 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs40 Summary: DP, BLAST & PSI-BLAST Dynamic programming is O(NM) for pairwise alignment BLAST is O(M) BLAST produces an index of words in query sequence that allows fast matching to the database At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold PSI-BLAST iterates BLAST, adding new homologs at each iteration

41 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs41 Applications of MSA Building phylogenetic trees Finding conserved patterns: Regulatory motifs (TF binding sites) Splice sites Protein domains Identifying and characterizing protein families Find out which protein domains have same function Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) DNA fragment assembly (in genomic sequencing)

42 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs42 Application: Discover Conserved Patterns Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Is there a conserved cis-acting regulatory sequence? Sequence Logo

43 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs43 Sequence Motifs (Patterns) Other types of representations? √ Consensus Sequence √ PSSM - Position-Specific Scoring Matrix √ Sequence Logo - "enhanced"consensus sequence, in which symbol size  information entropy Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipediainformation theoryrandom variable Check out this fun website: Tom Scheider, NCIF http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo Profile HMM - Hidden Markov Model


Download ppt "9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov."

Similar presentations


Ads by Google