9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Multiple Sequence Alignments
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Project Phase II Report l Due on 10/20, send me through l Write on top of Phase I report. l 5-20 Pages l Free style in writing (use 11pt font or.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
#11 - MSAs; PSSMs & Psi-BLAST
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics
Presentation transcript:

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs2 √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models Chp 6 - pp Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22: Fri Sept 21 - EXAM 1 Required Reading (before lecture)

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs3 Assignments & Announcements √Sun Sept 16 - Study Guide for Exam 1 was posted √Mon Sept 17 - Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming?

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs4 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment √Scoring Function √Exhaustive Algorithms Heuristic Algorithms Star Alignment Clustal √Practical Issues First, review MSA scoring briefly, then back to Star Alignment & ClustalW

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs5 Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m Gap penalty AFPGQIKAFPGQIK FFFIYYYFFFIYYY GGQGQGKGGQGQGK FFFIDDDFFFIDDD AFPGQIKAFPGQIK FFFIDDDFFFIDDD WWWWWWWWWWWWWW FFFII--FFFII-- AFPGQIKAFPGQIK ---IDDD---IDDD GGGGGGGGGGGGGG -FFIYYY-FFIYYY

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs6 Sum of Pairs (SP) Score SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix Compute for each column c: S(m i ) =  k<l s(m i k, m i l ) AFPGAFPG FFFIFFFI GGQGGGQG FFFIFFFI AFPGAFPG FFFIFFFI WWWWWWWW FFI-FFI- AFPGAFPG --DD--DD GGGGGGGG -FFY-FFY FFI-FFI- mimi PAM or BLOSUM score residue l

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs7 Example: Calculating SP Score FYGD F5-2 Y71-5 G4-3 D5 S(m) = S(m 1 ) + S(m 2 ) + S(m 3 ) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = = -3 Gap penalty = -8 s(-,-) = 0 BLOSUM 60 F - G F Y D M = GGDGGD m1m1 m2m2 m3m3 I added more colors to this slide

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs8 Algorithms & Software for MSA? #1 Exhaustive Methods √ Multidimensional dynamic programming (DP) Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods Progressive (Star Alignment, Clustal) Iterative Block-based

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs9 Dynamic Programming for MSA As with pairwise alignments, MSAs can be computed by dynamic programming* F 2D 3D *(if you're not in a rush!)

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs10 Generalized Needleman-Wunsch Algorithm Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) +  (x i, y j, z k ), S(i-1, j-1, k ) +  (x i, y j, - ), S(i-1, j, k-1) +  (x i, -, z k ), S(i-1, j, k ) +  (x i, -, - ), S(i, j-1, k-1) +  ( -, y j, z k ), S(i, j-1, k ) +  ( -, y j, -), S(i, j, k-1) +  ( -, -, z k ) ) 3D

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs11 Given k sequences of length n Space for matrix: O(n k ) Neighbors/cell: 2 k -1 Time to compute SP score: O(k 2 ) Overall runtime: O(k 2 2 k n k )  Wow!!! 3D What Happens to Computational Complexity ?

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs12 What's so bad about those exponents? Example: Running Time of DP for MSA Overall runtime: O(k 2 2 k n k ) # SequencesRunning Rime 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences? Globins only »150 aa !! But: There are fast heuristics

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs13 Progressive Alignment Heuristic procedure: 1.Align most similar sequences first 2.Add sequences progressively Often: use guide tree to determine order of alignments 2 Examples: Star Alignment ClustalW Multiple Alignment by adding sequences

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs14 Guide Trees Binary tree Leaves correspond to sequences Internal nodes represent alignments Root corresponds to final MSA ATCATGTCG ATC ATG ATC- ATG- -TCC TCC TCG TCC -TCG

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs15 Star Alignment - skipped on Monday: will NOT be covered on Exam 1 Back to 2 Examples of Progressive Alignment Heuristics for MSA: 1.STAR Alignment 2.Clustal

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs16 Star Alignment Fast heuristic to compute MSA Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality Algorithm: 1.Compute pairwise similarities 2.Select center s c that maximizes Σ i  c S(s c,s i ) 3.Add sequences in decreasing order of similarity to center s c 4.Produce a multiple alignment M such that, for every i, the induced pairwise alignment of s c and s i is same as the optimal alignment of s c and s i

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs17 Does that function look familiar? Step 2 - Select center s c that maximizes Σ i  c S(s c,s i ) FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Steiner consensus sequence or string: Given sequences s 1,…, s k, find a sequence s* that maximizes Σ i S(s*,s i ) "String" equivalent of arithmetic mean: consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs18 Step 3 - Add sequences in decreasing order of similarity to center s c s2s2 s1s1 s3s3 s4s4 s 1 : MPE s 2 : MKE s 3 : MSKE s 4 : SKE MPE | MKE MSKE | || M-KE MKE || SKE MSKE M-KE M-PE MSKE M-KE S-KE M-PE MSKE M-KE S 2 +S 3 +S 1 +S 4

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs19 Step 4 - Produce a multiple alignment M such that for every i: the induced pairwise alignment of s c and s i is same as optimal alignment of s c and s i S c AA--CCTT S 1 AATGCC-- S c A-ACC-TT S 2 AGACCGT- S 1 A-ATGCC--- S c A-A--CC-TT S 2 AGA--CCGT-

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs20 Complexity of Star Alignment? Given k sequences of length n, and an upper bound l for alignment length We need: O(k 2 n 2 ) to compute the alignments O(k 2 ) to compute the center O(k 2 l) to build multiple alignment Overall: O(k 2 n 2 ) Duh - Is this really much better than O(k 2 2 k n k )? YES! Remember: k = # of sequences n = length of sequences

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs21 CLUSTAL: Overview Progressive Alignment Pairwise Alignments Guide Tree Distance Matrix 1.Compute pairwise alignments (DP) 2.Convert similarities into distances Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) 3.Build guide tree from distances by Neighbor Joining 4.Align with respect to guide tree

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs22 CLUSTAL: Example

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs23 One "small" problem? Finding the Guide Tree Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Guide Tree Distance Matrix Problem: Such a tree might not exist!

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs24 CLUSTAL W Tree Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs25 Algorithms & Software for MSA? #2 √ Exhaustive Methods Multidimensional dynamic programming (DP) Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods √Progressive (Star Alignment, Clustal) Iterative Block-based

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs26 Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued Progressive alignments (Star Alignment, Clustal) Others: T-Coffee, DbClustal -see text: can be better than Clustal Match closely-related sequences first using a guide tree Partial order alignments (POA) Doesn't rely on guide tree; adds sequences in order given PRALINE Preprocesses input sequences by building profiles for each Iterative methods Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) Block-based Alignment Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) Local alignments Profiles, Blocks, Patterns - more on these soon!

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs27 Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs √Position Specific Scoring Matrices (PSSMs) √PSI-BLAST First, review above briefly, then: Profiles Markov Models & Hidden Markov Models

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs28 PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) Position Specific Iterated BLAST Intuition: substitution matrices should be "sensitive" to protein context e.g., larger penalty for Ala → Gly substitution if in a helix rather than in a loop Basic idea: Use BLAST with high stringency to generate a set of closely related sequences Align those sequences to create a new substitution matrix for each position Use this matrix (iteratively) to find additional sequences

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs29 PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold Position-Specific Scoring Matrix Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM.

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs30 What is a PSSM? Position-Specific Scoring Matrix A PSSM is: a representation of a motif an n by m matrix, where n is size of alphabet & m is length of sequence a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position A R N D C Q E G H I L K M F-3 6 P S T W Y V letter alphabet 8 residue sequence “K” at position 3 gets a score of 2 Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent I added more text to this slide Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs31 Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: = 12 A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs32 Creating a PSSM from 1 Sequence A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3 BLOSUM62 matrix RNRGQFGH R R 20 by by L L

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs33 Creating a PSSM from Multiple Sequences 1.Discard columns that contain gaps in query sequence 2.Compute relative sequence weights 3.Compute PSSM entries, taking into account Observed residues in column Sequence weights Substitution matrix

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs34 1- Discard Columns with Gaps in Query EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs35 2- Compute Sequence Weights Smaller weights are assigned to redundant sequences Larger weights are assigned to unique sequences EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 How are weights determined? Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters Info re: weights was added to this slide

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs36 3- Compute PSSM Entries (simplified version) EQRGKAFAEQRGKAFA PSSM Background frequencies A C D E F G H I K L M P Q R S T V W Y Observed residues PSSM column = Usually derived from large sequence database / This slide was modified

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs37 PSSM Entries = Log-Odds Scores Observed frequency of residue “A” Foreground model (i.e., the PSSM) Background model 1.Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2.Divide by background probability of observing each residue (probability of A given B, where B is background model) 3.Take log so that can add (rather than multiply) scores This slide was modified

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs38 Why (not) PSI-BLAST? Psi-BLAST weights sequences according to observed diversity specific to family under investigation Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs39 How to Use PSI-BLAST Effectively Set initial thresholds high Inspect each iteration's result for suspicious sequences ( When in doubt, leave it out!) Do several iterations (~5), or until no new sequences are found Make initial search very broad First, use NR (large, inclusive database) with up to 5 iterations to set PSSM Then use that PSSM to search in a more restricted domain, if possible Be particularly cautious about matches to sequences with highly biased amino acid content

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs40 Summary: DP, BLAST & PSI-BLAST Dynamic programming is O(NM) for pairwise alignment BLAST is O(M) BLAST produces an index of words in query sequence that allows fast matching to the database At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold PSI-BLAST iterates BLAST, adding new homologs at each iteration

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs41 Applications of MSA Building phylogenetic trees Finding conserved patterns: Regulatory motifs (TF binding sites) Splice sites Protein domains Identifying and characterizing protein families Find out which protein domains have same function Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) DNA fragment assembly (in genomic sequencing)

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs42 Application: Discover Conserved Patterns Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Is there a conserved cis-acting regulatory sequence? Sequence Logo

9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs43 Sequence Motifs (Patterns) Other types of representations? √ Consensus Sequence √ PSSM - Position-Specific Scoring Matrix √ Sequence Logo - "enhanced"consensus sequence, in which symbol size  information entropy Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipediainformation theoryrandom variable Check out this fun website: Tom Scheider, NCIF Profile HMM - Hidden Markov Model