Download presentation
Presentation is loading. Please wait.
Published byViljem Kuzmanović Modified over 6 years ago
1
BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics
#9 Scoring Statistics BCB 444/544 9/10/07 Lecture 9 Finish: Scoring Matrices & Alignment Statistics BLAST vs FASTA (not yet!) Smith-Waterman Algorithm #9_Sept10 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
2
Required Reading (before lecture)
#9 Scoring Statistics Required Reading (before lecture) 9/10/07 Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW Chp 4 - pp 51-62 Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles Chp 6 - pp (but not HMMs) Good Additional Resource re: Sequence Alignment? Wikipedia: BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
3
Assignments & Announcements - #1
#9 Scoring Statistics Assignments & Announcements - #1 9/10/07 Revised Grading Policy has been posted online (see Handout) - Please review! Mon Sept 10 - Lab 3 Exercise due 5 PM: to: Thu Sept 13 - Graded Lab 3 will be returned at beginning of Lab 4 Fri Sept HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
4
Assignments & Announcements - #2
#9 Scoring Statistics Assignments & Announcements - #2 9/10/07 Mon Sept 17 - Answers to HW#2 will be posted on by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept Exam 1 - Will cover: Lectures 2-12 Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
5
Chp 3- Sequence Alignment
#9 Scoring Statistics 9/10/07 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment √Evolutionary Basis √Sequence Homology versus Sequence Similarity √Sequence Similarity versus Sequence Identity √Methods - (Dot Plots, DP; Global vs Local Alignment) √Scoring Matrices (PAM vs BLOSUM) Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
6
First, let's re-visit DP for Local Alignment:
#9 Scoring Statistics First, let's re-visit DP for Local Alignment: 9/10/07 explaining "confusion" in Lecture 8 on Friday was sent on Sunday (so you wouldn't try to do HW2 without a better explanation!) Answers to DP Examples given in Lectures are included in Lecture PPTs for Lectures 8 (Friday) & 9 (Today): Global Alignment Local Alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
7
What are the 2 Global Alignments with Optimal Score = 33?
#9 Scoring Statistics 9/10/07 What are the 2 Global Alignments with Optimal Score = 33? Top: C T C G C A G C Left: C A T T C A C C T C G C A G C C A T T C A C 1: C T C G C A G C C A T T C A C 2: Check the scores: +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
8
Local Alignment: Motivation
#9 Scoring Statistics Local Alignment: Motivation 9/10/07 To "ignore" stretches of non-coding DNA: Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions Local alignment between two protein-encoding sequences is likely to be between two exons To locate protein domains or motifs: Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
9
Local Alignment: Example
#9 Scoring Statistics Local Alignment: Example 9/10/07 G G T C T G A G A A A C G A Match: +2 Mismatch or space: -1 Best local alignment: G G T C T G A G A A A C – G A - Score = 5 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
10
Local Alignment: Algorithm
#9 Scoring Statistics 9/10/07 Local Alignment: Algorithm This slide has been changed! 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
11
Local Alignment DP: Initialization & Recursion
#9 Scoring Statistics 9/10/07 Local Alignment DP: Initialization & Recursion New Slide BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
12
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07 Filling in DP Matrix for Local Alignment No negative scores - fill in "0" λ C T C G C A G C A C T λ 1 2 +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
13
Traceback - for Local Alignment
#9 Scoring Statistics 9/10/07 Traceback - for Local Alignment λ C T C G C A G C A C T λ 1 2 1 4 2 3 +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
14
What are the 4 Local Alignments with Optimal Score = 2?
#9 Scoring Statistics 9/10/07 What are the 4 Local Alignments with Optimal Score = 2? C T C G C A G C C A T T C A C C T C G C A G C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
15
What are the 4 Local Alignments with Optimal Score = 2?
#9 Scoring Statistics 9/10/07 What are the 4 Local Alignments with Optimal Score = 2? C T C G C A G C C A T T C A C C T C G C A G C C A T T 1: C T C G C A G C C A T T C A C 2: C T C G C A G C T T C A C 3: C T C G C A G C T T C A C 4: Check the scores: +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
16
Some Results re: Alignment Algorithms (for ComS, CprE & Math types)
#9 Scoring Statistics 9/10/07 Some Results re: Alignment Algorithms (for ComS, CprE & Math types) Most pairwise sequence alignment problems can be solved in O(mn) time Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] for Biologists: Big O notation used when analyzing algorithms for efficiency refers to time or number of steps it takes to solve a problem expressed as a function of size of the problem BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
17
"Scoring" or "Substitution" Matrices
#9 Scoring Statistics 9/10/07 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
18
PAM Matrix: Point Accepted Mutation
#9 Scoring Statistics 9/10/07 I added 2 bullets to this slide PAM Matrix: Point Accepted Mutation Relies on "evolutionary model" based on observed differences in closely related proteins [Dayhoff78] Model includes defined rate for each type of sequence change Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed e.g., PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed PAM1 matrix is used as basis for calculating other matrices: assumes that repeated mutations would follow same pattern as those in PAM1 matrix, and multiple substitutions can occur at the same site PAM1 - for less divergent sequences (shorter time) PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
19
BLOSUM: BLOck SUbstitution Matrix
#9 Scoring Statistics 9/10/07 I added 2 bullets to this slide BLOSUM: BLOck SUbstitution Matrix Based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins (in BLOCKS database) [Henikoff & Henikoff92] Doesn't rely on a specific evolutionary model Suffix number (n) reflects expected similarity: avg % aa identity in MSA from which matrix was generated e.g., BLOSUM62 is derived from sequence alignments of proteins with no more than 62% identity Blocks database contains ungapped aligned segments corresponding to the most highly conserved regions of proteins BLOSUM45 - for more divergent sequences BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
20
Scoring Matrices: What are the scores?
#9 Scoring Statistics 9/10/07 Scoring Matrices: What are the scores? See Xiong Textbook: Fig 3.5 = PAM250 Fig 3.6 = BLOSUM62 Usually only 1/2 of matrix is displayed (it is symmetric) s(a,b) corresponds to score of aligning character a with character b These are log-odds scores: each entry ~ log (freq(observed)/freq(expected) + more likely than random 0 at random base rate - less likely than random BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
21
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Log-odds scoring 9/10/07 What are the odds that this alignment is meaningful? x1x2x3 xN y1y2y3 yN If sequences are not related: we’re observing a chance event, & the probability is: where px is the probability of x, py is probability of y If sequences are related by evolution: they are derived from a common ancestor, & the probability is: where pxy is the joint probability that x and y evolved from the same ancestor BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
22
Log-odds scoring matrix
#9 Scoring Statistics Log-odds scoring matrix 9/10/07 Odds ratio = Relative likelihood of the 2 possibilities: Alignment score = Log-odds ratio: where Thus, s (xi, yi) gives the substitution matrix score for the pair xi, yi. Together all the scores s(xi, yi) define the log-odds scoring matrix BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
23
How do we estimate s(x, y)?
#9 Scoring Statistics How do we estimate s(x, y)? 9/10/07 The score for matching x and y is: Pxy is probability of substituting x and y Px is probability of amino acid x (on average ~ 5% with 20 amino acids, similarly for Py) Trusted (manual) alignments of related sequences provide information about biologically permissible mutations Frequency of amino acid substitutions in trusted alignments is used to generate substitution matrices BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
24
A Few Words about Parameter Selection in Sequence Alignment
#9 Scoring Statistics A Few Words about Parameter Selection in Sequence Alignment 9/10/07 Optimal alignment between a pair of sequences depends critically on the selection of substitution matrix & gap penalty function In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) How do we pick parameters that give the most biologically meaningful alignments and alignment scores? BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
25
Which is Better Substitution Matrix? PAM or BLOSUM
#9 Scoring Statistics 9/10/07 Which is Better Substitution Matrix? PAM or BLOSUM PAM matrices derived from evolutionary model often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences BLOSUM matrices based on direct observations more "realistic" - and outperform PAM matrices in terms of accuracy in local alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
26
Empirical Tests May be Needed:
#9 Scoring Statistics 9/10/07 Empirical Tests May be Needed: Several other types of matrices available: Gonnet & Jones-Taylor-Thornton: very robust in tree construction "Best" matrix depends on task: different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
27
How Should Gaps be Scored?
#9 Scoring Statistics How Should Gaps be Scored? 9/10/07 (k) So far, we've used Simple linear gap penalty function: Gap of length k Incurs penalty k x However, in biological sequences, gaps often occur in clusters: AGKLAVRSTMIESTRVILTWRKW AGKLAVRS------RVILTWRKW More realistic? "Affine" gap penalty: penalty for one long gap is smaller than penalty for many smaller gaps that add up to same size w(k) w(k) = + (k – 1) x gap gap opening extension BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
28
Affine Gap Penalty Functions
#9 Scoring Statistics Affine Gap Penalty Functions 9/10/07 Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is function of gap length: W = X (k - 1) where = gap opening penalty = gap extension penalty k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty Can also be solved in O(nm) time using DP BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
29
#9 Scoring Statistics 9/10/07 Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap a b c - - d a c c e f d => (10 + 2) = 12 Match score Gap opening + extension Alignment Score Values from substitution matrix BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
30
Sequence Alignment Statistics
#9 Scoring Statistics Sequence Alignment Statistics 9/10/07 Distribution of similarity scores in sequence alignment is not a simple "normal" distribution "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
31
How Assess Statistical Significance of an Alignment?
#9 Scoring Statistics How Assess Statistical Significance of an Alignment? 9/10/07 Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence If score is in extreme margin, then unlikely due to random chance P-value = probability that original alignment is due to random chance (lower P means alignment more significant) P = sequences have clear homology P > alignment is no better than random Check out: PRSS (Probability of Random Shuffles) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.