#8 Finish DP, Scoring Matrices, Stats & BLAST

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon1 BCB 444/544 Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment:
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
#5 - Dynamic Programming
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

#8 Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 9/7/07 Lecture 8 Finish: Dynamic Programming Global vs Local Alignment Scoring Matrices & Alignment Statistics BLAST #8_Sept7 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Required Reading (before lecture) #8 Finish DP, Scoring Matrices, Stats & BLAST Required Reading (before lecture) 9/7/07 √Last week: - for Lectures 4-7 Pairwise Sequence Alignment, Dynamic Programming, Global vs Local Alignment, Scoring Matrices, Statistics Xiong: Chp 3 Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html √Wed Sept 5 - for Lecture 7 & Lab 3 Database Similarity Searching: BLAST (nope, more DP) Chp 4 - pp 51-62 Fri Sept - for Lecture 8 (will finish on Monday) BLAST variations; BLAST vs FASTA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Assignments & Announcements #8 Finish DP, Scoring Matrices, Stats & BLAST Assignments & Announcements 9/7/07 √Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM Send via email to Pete Zaback petez@iastate.edu (For now, no late penalty - just send ASAP) √Wed Sept 5 - Notes for Lecture 5 posted online - HW#2 posted online & sent via email & handed out in class Fri Sept 14 - HW#2 Due by 5 PM Fri Sept 21 - Exam #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Chp 3- Sequence Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment √Evolutionary Basis √Sequence Homology versus Sequence Similarity √Sequence Similarity versus Sequence Identity Methods - cont Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST Methods 9/7/07 √Global and Local Alignment √Alignment Algorithms √Dot Matrix Method Dynamic Programming Method - cont Gap penalities DP for Global Alignment DP for Local Alignment Scoring Matrices Amino acid scoring matrices PAM BLOSUM Comparisons between PAM & BLOSUM Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Dynamic Programming - 4 Steps: #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Dynamic Programming - 4 Steps: Define score of optimal alignment, using recursion Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach) Calculate score of optimal alignment(s) Trace back through matrix to recover optimal alignment(s) that generated optimal score BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

1- Define Score of Optimal Alignment using Recursion #8 Finish DP, Scoring Matrices, Stats & BLAST 1- Define Score of Optimal Alignment using Recursion 9/7/07 Define: Initial conditions:  = Match Reward = Mismatch Penalty  = Gap penalty Recursive definition: For 1  i  N, 1  j  M: (xi,yj) =  or   = Gap penalty BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems Construct sequence vs sequence matrix Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment ending at residues at [i,j] 1 N S(0,0)=0 1 S(i,j) S(N,M) M BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 How do we calculate S(i,j)? i.e., Score for alignment of x[1..i] to y[1..j]? 1 of 3 cases  optimal score for this subproblem: x1 x2 . . . xi-1 xi y1 y2 . . . yj-1 yj S(i-1,j-1) + (xi,yj) x1 x2 . . . xi-1 xi y1 y2 . . . yj — S(i-1,j) -  x1 x2 . . . xi — S(i,j-1) -  xi aligns to yj xi aligns to a gap yj aligns to a gap BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Specific Example: Note: I changed sequences on this slide (to match the rest of DP example) Scoring Consequence? Case 1: Line up xi with yj i - 1 i x: C - T C G C A y: C A T - T C A Match Bonus j - 1 j Case 2: Line up xi with space i - 1 i x: C - T C G C - A y: C A T - T C A - Space Penalty j Case 3: Line up yj with space i x: C - T C G C A - y: C A T - T C - A Space Penalty j -1 j BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Ready? Fill in DP Matrix Keep track of dependencies of scores (in a pointer matrix) 1 N S(0,0)=0 + (xi,yj) =  or  1 S(i-1,j-1) S(i-1,j)  = Match Reward = Mismatch Penalty  = Gap penalty -  S(i,j-1) S(i,j) -  S(N,M) M Initialization Recursion BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Fill in the DP matrix !! λ C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 λ C -5 -10 -15 -20 -25 -30 -35 10 5 A T T C We first compute T[i, j] for the smallest possible values of i and j, then for increasing values of i and j Usually performed with a table of size (n + 1) X (m + 1) A C +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 We first compute T[i, j] for the smallest possible values of i and j, then for increasing values of i and j Usually performed with a table of size (n + 1) X (m + 1) +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 We first compute T[i, j] for the smallest possible values of i and j, then for increasing values of i and j Usually performed with a table of size (n + 1) X (m + 1) +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 4- Trace back through matrix to recover optimal alignment(s) that generated the optimal score How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback - for Global Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback - for Global Alignment 9/7/07 Start in lower right corner & trace back to upper left Each arrow introduces one character at end of alignment: A horizontal move puts a gap in left sequence A vertical move puts a gap in top sequence A diagonal move uses one character from each sequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 Can have >1 optimal alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 Where did red arrows come from? BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 +10 for match, -2 for mismatch, -5 for space Where did 33 come from? Match = 10, so 33-10= 23 Must have come from diagonal Where did 23 come from? (Not a match) Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 +10 for match, -2 for mismatch, -5 for space Where did 8 come from? Two possibilities: 13-5= 8 or 10-2=8 Then, follow both paths BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 C with C - with A T with T C with - G with T C with C A with A G with - C with C Great - but what are the alignments? #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback to Recover Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 C with C - with A T with T C with T G with - C with C A with A G with - C with C Great - but what are the alignments? #2 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

What are the 2 Global Alignments with Optimal Score = 33? #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 What are the 2 Global Alignments with Optimal Score = 33? Top: C T C G C A G C Left: C A T T C A C C - T C G C A G C 1: C - T C G C A G C 2: A horizontal move puts a gap in left sequence A vertical move puts a gap in top sequence A diagonal move uses one character from each sequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

What are the 2 Global Alignments with Optimal Score = 33? #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 What are the 2 Global Alignments with Optimal Score = 33? Top: C T C G C A G C Left: C A T T C A C C - T C G C A G C C A T - T C A - C 1: C - T C G C A G C C A T T - C A - C 2: A horizontal move puts a gap in left sequence A vertical move puts a gap in top sequence A diagonal move uses one character from each sequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST Check Traceback? 9/7/07 λ C T C G C A G C C A T λ -5 -10 -15 -20 -25 -30 -35 -40 10 5 8 3 -2 -7 15 13 -4 20 18 28 23 26 33 v d 1 d h d h 2 d h A horizontal move puts a gap in left sequence A vertical move puts a gap in top sequence A diagonal move uses one character from each sequence BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Local Alignment: Motivation #8 Finish DP, Scoring Matrices, Stats & BLAST Local Alignment: Motivation 9/7/07 To "ignore" stretches of non-coding DNA: Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions Local alignment between two protein-encoding sequences is likely to be between two exons To locate protein domains or motifs: Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Local Alignment: Example #8 Finish DP, Scoring Matrices, Stats & BLAST Local Alignment: Example 9/7/07 G G T C T G A G A A A C G A Match: +2 Mismatch or space: -1 Best local alignment: G G T C T G A G A A A C – G A - Score = 5 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Local Alignment: Algorithm #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Local Alignment: Algorithm S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y Initialize top row & leftmost column of matrix with "0" Recall: for Global Alignment, S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y Initialize top row & leftmost column of with gap penalty BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Filling in DP Matrix for Local Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Filling in DP Matrix for Local Alignment λ C T C G C A G C A C T λ 1 2 +1 for a match, -1 for a mismatch, -5 for a space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Traceback - for Local Alignment #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Traceback - for Local Alignment λ C T C G C A G C A C T λ 1 2 +1 for a match, -1 for a mismatch, -5 for a space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

What are the 4 Local Alignments with Optimal Score = 2? #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 What are the 4 Local Alignments with Optimal Score = 2? C T C G C A G C C A T T C A C C T C G C A G C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Some Results re: Alignment Algorithms (for ComS, CprE & Math types!) #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Some Results re: Alignment Algorithms (for ComS, CprE & Math types!) Most pairwise sequence alignment problems can be solved in O(mn) time Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Affine Gap Penalty Functions #8 Finish DP, Scoring Matrices, Stats & BLAST Affine Gap Penalty Functions 9/7/07 Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is linear function of gap length: W =  +  X (k - 1) where  = gap opening penalty  = gap extension penalty k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty Can also be solved in O(nm) time using DP BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST Methods 9/7/07 √Global and Local Alignment √Alignment Algorithms √Dot Matrix Method √Dynamic Programming Method - cont Gap penalities DP for Global Alignment DP for Local Alignment Scoring Matrices Amino acid scoring matrices PAM BLOSUM Comparisons between PAM & BLOSUM Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

"Scoring" or "Substitution" Matrices #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in closely related proteins Model includes defined rate for each type of sequence change Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed PAM1 - for less divergent sequences (shorter time) PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins Doesn't rely on a specific evolutionary model Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated BLOSUM45 - for more divergent sequences BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 PAM250 vs BLOSUM 62 See Text Fig 3.5 = PAM250 Fig 3.6= BLOSUM62 Usually only 1/2 of matrix is displayed (it is symmetric) Here: s(a,b) corresponds to score of aligning character a with character b BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Which is Better? PAM or BLOSUM #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Which is Better? PAM or BLOSUM PAM matrices derived from evolutionary model often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences BLOSUM matrices based on direct observations more 'realistic" - and outperform PAM matrices in terms of accuracy in local alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Which Type of Matrix Should You Use? #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Which Type of Matrix Should You Use? Several other types of matrices available: Gonnet & Jones-Taylor-Thornton: very robust in tree construction "Best" matrix depends on task: different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Sequence Alignment Statistics #8 Finish DP, Scoring Matrices, Stats & BLAST Sequence Alignment Statistics 9/7/07 Distribution of similarity scores in sequence alignment is not a simple "normal" distribution "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

How Assess Statistical Significance of an Alignment? #8 Finish DP, Scoring Matrices, Stats & BLAST How Assess Statistical Significance of an Alignment? 9/7/07 Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence If score is in extreme margin, then unlikely due to random chance P-value = probability that original alignment is due to random chance (lower P is better) P = 10-5 - 10-50 sequences have clear homology P > 10-1 no better than random Check out: PRSS (Probability of Random Shuffles) http://www.ch.embnet.org/software/PRSS_form.html BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Chp 4- Database Similarity Searching #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching Unique Requirements of Database Searching Heuristic Database Searching Basic Local Alignment Search Tool (BLAST) FASTA Comparison of FASTA and BLAST Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Exhaustive vs Heuristic Methods #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Exhaustive vs Heuristic Methods Exhaustive - tests every possible solution guaranteed to give best answer (identifies optimal solution) can be very time/space intensive! e.g., Dynamic Programming as in Smith-Waterman algorithm Heuristic - does NOT test every possibility no guarantee that answer is best (but, often can identify optimal solution) sacrifices accuracy (potentially) for speed uses "rules of thumb" or "shortcuts" e.g., BLAST & FASTA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Today's Lab: focus on BLAST Basic Local Alignment Search Tool #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Today's Lab: focus on BLAST Basic Local Alignment Search Tool STEPS: Create list of very possible "word" (e.g., 3-11 letters) from query sequence Search database to identify sequences that contain matching words Score match of word with sequence, using a substitution matrix Extend match (seed) in both directions, while calculating alignment score at each step Continue extension until score drops below a threshold (due to mismatches) Contiguous aligned segment pair (no gaps) is called: High Scoring Segment Pair (HSP) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

Lab3: focus on BLAST Basic Local Alignment Search Tool #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Lab3: focus on BLAST Basic Local Alignment Search Tool BLAST Results? Original version of BLAST? List of HSPs = Maximum Scoring Pairs More recent, improved version of BLAST? Allows gaps: Gapped Alignment How? Allows score to drop below threshold, (but only temporarily) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 BLAST - a few details Developed by Stephen Aultschul at NCBI in 1990 Word length? Typically: 3 aa for protein sequence 11 nt for DNA sequence Substitution matrix? Default is BLOSUM62 Can change under Algorithm Parameters Choose other BLOSUM or PAM matrices Stop-Extension Threshold? Typically: 22 for proteins 20 for DNA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

BLAST - Statistical Significance? #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 BLAST - Statistical Significance? E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random change, thus higher significance Bit Score: S' normalized score, to account for differences in sequence length & size of database 3. Low Complexity Masking remove repeats that confound scoring BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs

BLAST - a Family of Programs: Several different BLAST "flavors" #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 BLAST - a Family of Programs: Several different BLAST "flavors" BLASTN - BLASTP - BLASTX - TBLASTN - BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs