CISC667, F05, Lec7, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence pairwise alignment Score statistics –Bayesian –Extreme value distribution.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Hidden Markov Models Theory By Johan Walters (SR 2003)
Searching Sequence Databases
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
CHAPTER 6 Statistical Analysis of Experimental Data
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Statistical hypothesis testing – Inferential statistics I.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Describing a Score’s Position within a Distribution Lesson 5.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Scoring Sequence Alignments Calculating E
Blast Basic Local Alignment Search Tool
Sequence comparison: Local alignment
CONCEPTS OF HYPOTHESIS TESTING
Sequence comparison: Significance of similarity scores
Pairwise Sequence Alignment (cont.)
CISC 667 Intro to Bioinformatics (Spring 2007) Review session for Mid-Term CISC667, S07, Lec14, Liao.
Lecture 6: Sequence Alignment Statistics
Sequence comparison: Significance of similarity scores
1-month Practical Course Genome Analysis Iterative homology searching
Searching Sequence Databases
Presentation transcript:

CISC667, F05, Lec7, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence pairwise alignment Score statistics –Bayesian –Extreme value distribution –E-value

CISC667, F05, Lec7, Liao Significance of scores Goals for sequence alignments: (1) whether and (2) how two sequences are related. It is rare that you have just two particular sequences to compare. More often, you have one query sequence and a large database of sequences. Database searching: find all sequences in the database that are related to the query sequence. Solution: (1)For each sequence in the database, use Smith-Waterman/FASTA/BLAST to align with the query sequence and return the score of the optimal alignment. (2) Rank the sequences by the score. Q: how good is a score?

CISC667, F05, Lec7, Liao Score statistics [Karlin & Altschul 1990] -The score of an ungapped alignment is H i,j = max{H i-1,j-1 + s(x i,y i ), 0} - ∑ a,b  alphabet s(a,b)p(a)p(b) < 0  most regions receive zero score. -The scores of individual sites are independent. -The landscape of non-zero regions are “islands” in the sea. -The optimal alignment score S is the global maximum of these island peaks: S = max{σ 1, σ 2, σ 1,… σ κ }

CISC667, F05, Lec7, Liao –The island peak σ i satisfies a Poisson distribution: Pr(σ i > x) = (constant) e -λ x –The parameter λ is the positive root of the equation ∑ a,b  alphabet e s(a,b) p(a) p(b) = 1

CISC667, F05, Lec7, Liao The probability that the maximum S is smaller than x is P(S x) ) ] → exp[- κ e -λx ] when κ → . This is a form of Extreme Value Distribution. Definition: Take N samples from the density g(u), the probability that the largest amongst them is less than x is G(x) N, where G(x) = ∫ -∞ x g(u) du. The density is Ng(x)G(x) N-1. The limit for large N of Ng(x)G(x) N-1 is called the EVD for g(x). (Draw a EVD curve). p-value = probability of at least one sequence scoring with S > x in the given database. P(S > x) = 1- exp[- κ e -λx ]. E-value = expected number of scores better than S in a database search. E(S) = kmn e –λS.

CISC667, F05, Lec7, Liao Notes: All of the above discussions only applicable to local alignments. For gapped local alignments, same statistics are believed to apply, although not proved. The trick is to learn parameters λ and K. These values depend upon the substitution matrix and can be estimated from randomly generated data. Score statistics for global alignments are not well known. Q: What is a bit score in the blast search result? A: The bit score is defined as S’ = (λS – ln K)/ln2 it is then convenient to calculate the e-value E(S) = mn 2 –S’

CISC667, F05, Lec7, Liao Question: What threshold to use? Answer: No absolute answer, it varies. E.g., E-value < 1. E(S) = kmn e –λS.  S > T + log(mn)/λ. More details are referred to the text and the following. 1.Y.K. Yu & T. Hwa, “Statistical significance of probabilistic sequence alignment and related local hidden Markov models”, J. Computational Biology 8(2001)

CISC667, F05, Lec7, Liao Bayesian perspective P(M|x,y) = P(x,y|M)P(M)/P(x,y) = P(x,y|M)P(M) / [P(x,y|R)P(R) + P(x,y|M)P(M)] [P(x,y|M)P(M)/P(x,y|R)P(R)] = [1+ P(x,y|M)P(M)/P(x,y|R)P(R)] if S = log[P(x,y|M)/P(x,y|R)], this is the score returned from a SW alignment. S’ = S + log[P(M)/P(R)] then P(M|x,y) = exp(S’) / [1 + exp(S’)] (draw a curve and explain)

CISC667, F05, Lec7, Liao