Download presentation
Presentation is loading. Please wait.
Published byCarmella Simmons Modified over 8 years ago
1
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6
2
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST2 Exhaustive vs Heuristic Methods Exhaustive - tests every possible solution guaranteed to give best answer (identifies optimal solution) can be very time/space intensive! e.g., Dynamic Programming as in Smith-Waterman algorithm Heuristic - does NOT test every possibility no guarantee that answer is best (but, often can identify optimal solution) sacrifices accuracy (potentially) for speed uses "rules of thumb" or "shortcuts" e.g., BLAST & FASTA
3
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST3 Today's Lab: focus on BLAST B asic L ocal A lignment S earch T ool STEPS: 1.Create list of very possible "word" (e.g., 3-11 letters) from query sequence 2.Search database to identify sequences that contain matching words 3.Score match of word with sequence, using a substitution matrix 4.Extend match (seed) in both directions, while calculating alignment score at each step 5.Continue extension until score drops below a threshold (due to mismatches) 6.Contiguous aligned segment pair (no gaps) is called: High Scoring Segment Pair (HSP)
4
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST4 Today's Lab: focus on BLAST B asic L ocal A lignment S earch T ool Results? Original version of BLAST? List of HSPs = Maximum Scoring Pairs More recent, improved versionof BLAST? Allows gaps: Gapped Alignment How? Allows score to drop below threshold, (but only temporarily)
5
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST5 BLAST - a few details Developed by Stephen Aultschul at NCBI in 1990 Word length? Typically: 3 aa for protein sequence 11 nt for DNA sequence Substitution matrix? Default is BLOSUM62 Can change under Algorithm Parameters Choose other BLOSUM or PAM matrices Stop Extension Threshold? Typically: 22 for proteins 20 for DNA
6
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST6 BLAST - a few more details BLAST is family of programs with several "variants" BLASTN - BLASTP - BLASTX - TBLASTM - TBLASTX - Statistical Significance? E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random change, thus higher significance Bit Score: S' is normalized, to account for sequence length differences & size of database Low Complexity Masking - remove repeats that confound scoring
7
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST7 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins
8
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST8 PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in closely related proteins Model includes defined rate for each type of sequence change Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed PAM1 - for less divergent sequences (shorter time) PAM250 - for more divergent sequences (longer time)
9
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST9 BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins Doesn't rely on a specific evolutionary model Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated BLOSUM45 - for more divergent sequences BLOSUM62 - for less divergent sequences
10
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST10 BLOSUM62 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later)
11
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST11
12
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST12 Affine Gap Penalty Functions Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is linear function of gap length: W = + X (k - 1) where = gap opening penalty = gap extension penalty k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty Can also be solved in O(nm) time using DP
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.