Pairwise Sequence Alignment (cont.)

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Searching Sequence Databases
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Sequence alignment, E-value & Extreme value distribution
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Pairwise Sequence Analysis-III
Significance in protein analysis
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence comparison: Significance of similarity scores
Variants of HMMs.
Multiple Sequence Alignment (I)
Sequence comparison: Significance of similarity scores
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Pairwise Sequence Alignment (II)
Searching Sequence Databases
Presentation transcript:

Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 6, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

4 Basic Questions in Pairwise Alignment (Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) Model: scoring function s: A X=x1,…,xn X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} Find the best alignment(s) … S(a*)= 21 Y=y1,…,ym Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! (Models for scores)

The Rest of This Lecture Q4: How to assess the significance of an alignment score? Classic approach: extreme value distribution Bayesian approach: model comparison Q1: How to define the scoring function? Define the substitution score s Define the gap penalty function g

First, Q4: Assessing Score Signficance In general, larger s  more significant. The question is how large should s be? Factors to be considered: Sequence length: longer sequences are expected to give higher scores # sequences in the database: the score of the best alignment is expected to be higher for a larger DB Evolution time: longer evolution causes more mismatches, making a lower score more significant The Challenge is how to quantify all these…

Log-odds score of the alignment Two Basic Approaches The classical approach: Extreme value distribution Assume a null (random) model for scores M0 P(Score > s|M0, x, y)=? The Bayesian approach: Model comparison Assume two models for (x,y): random M0; aligned: M1 P(M1|x,y)/P(M0|x,y)=? prior Log-odds score of the alignment

Extreme Value Distribution EVD: The asymptotic distribution of the maximum MN of a series of N independent normal random variables is In general, the maximum of a large number of separate scores follows this distribution Example: the best local match score between two long sequences constants mode

EVD of the Best Score in Ungapped Local Alignment The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean The probability that there is a match of score greater than S is K and  can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Parameters Sequence lengths

Bayesian Model Comparison Assumptions: M is a model for related sequences R is a model for unrelated sequences (random) Ungapped alignment n=m Alignment of each pair is independent Score S(x,y) Prior (Subjective!) This partially addresses Q1: how to design the scoring function?

Q1: How to Estimate Probabilities? General idea: Exploit sequences with known (“reliable”) alignments Simplest method: Max. Likelihood estimator Improved method: Consider evolution time (phylogenetic tree, to be covered later)

Dayhoff PAM Matrices Estimate p(b|a,t,M) (Substitution probabilities) rather than p(ba|M) Use sufficiently similar sequence pairs to estimate p(b|a,t=1,M) Compute p(b|a, t+1,M) based on p(b|a,t,M) Compute the score matrix (e.g., PAM 250)

BLOSUM Matrices Limitation of PAM: short time substitutions are dominated by trivial changes in the Codon triplets BLOSUM tries to improve the estimation of p(ab|M,t) by re-sampling the aligned, ungapped sequences regions (e.g., based on PAM) Time t is now connected with a threshold of sequence similarity, leading to different variations (e.g., BLOSUM50 & BLOSUM62)

Estimating Gap Penalties Again the basic idea is to exploit known alignments Basic assumptions: The gap-open score d is linear in log(t) The gap-extend score e is constant Example: (g)=A+B*log(t)+C*log(g) In practice, people choose the gap costs empirically for given substitution scores.