Sequence comparison: Significance of similarity scores

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
BLAST Sequence alignment, E-value & Extreme value distribution.
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Lecture 8 Alignment of pairs of sequence Local and global alignment
Searching Sequence Databases
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Pairwise sequence comparison
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Department of Computer Science
Sequence comparison: Traceback and local alignment
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
GENOME 559: Introduction to statistical and computational genomics
Sequence comparison: Dynamic programming
Pairwise sequence Alignment.
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Sequence comparison: Local alignment
Sequence comparison: Traceback
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Sequence comparison: Introduction and motivation
Basic Local Alignment Search Tool
Projects….
Sequence alignment, E-value & Extreme value distribution
1-month Practical Course Genome Analysis Iterative homology searching
Searching Sequence Databases
Presentation transcript:

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute responses Sample problems and lecture are both helpful. I like the practice problems. Interesting topics. Good, thought-provoking examples. I enjoy the practice problems and would be interested in a harder example. Example problems are too easy. Example problems are challenging but not too challenging. Great contents covered and pace. I like including optional problems. It is helpful to have Lindsay walk around during class with explanations. I like learning the terminology. I like spending more time with coding and sample problems. Lots of new Python commands, but no way around that. I like when each example adds to the complexity of the code. Appreciated the review problems for the matrix. Enjoyed learning about global and local alignment.

One-minute responses Lecture pace was good. Great speed. Good pacing. x5 This was the first class where I was able to complete the two practice problems. Thank you for slowing down. Might be nice to have a compiled list of commands. I need more practice with these commands. I got a little lost at the local alignment explanation. I prefer to read before class, so knowing the reading assignment in advance would be helpful. For a challenge, could you suggest how many lines of code to use in the examples? For the practice problems, it would be helpful to have 1 minutes of silence at the start of each problem.

Local DP in equation form

Local alignment Two differences with respect to global alignment: No score is negative. Traceback begins at the highest score in the matrix and continues until you reach 0. Global alignment algorithm: Needleman-Wunsch. Local alignment algorithm: Smith-Waterman.

Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 C

Local alignment AAG A G 2 4 6 C Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 C AAG

Are these proteins homologs? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W L Y N Y C L SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLF NO (score = 9) SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W LDATYKNYA Y C L SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF MAYBE (score = 15) SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY RVV L PS W LDATYKNYA Y CDVTYKL SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF YES (score = 24)

Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Homology detection algorithm 45 Low score = unrelated High score = homologs How high is high enough? LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE

Other significance questions Pairwise sequence comparison scores

Other significance questions Pairwise sequence comparison scores Gene expression measurements Sequence motif scores

The null hypothesis We are interested in characterizing the distribution of scores from sequence comparison algorithms. We would like to measure how surprising a given score is, assuming that the two sequences are not related. The assumption is called the null hypothesis. The purpose of most statistical tests is to determine whether the observed results provide a reason to reject the hypothesis that they are merely a product of chance factors.

Sequence similarity score distribution ? Frequency Sequence comparison score Search a randomly generated database of DNA sequences using a randomly generated DNA query. What will be the form of the resulting distribution of pairwise sequence comparison scores?

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. Frequency Value High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database. Frequency Value

Computing a p-value The probability of observing a score >X is the area under the curve to the right of X. This probability is called a p-value. p-value = Pr(data|null) Frequency Value Out of 1685 scores, 28 receive a score of 20 or better. Thus, the p-value associated with a score of 20 is approximately 28/1685 = 0.0166.

Problems with empirical distributions We are interested in very small probabilities. These are computed from the tail of the distribution. Estimating a distribution with accurate tails is computationally very expensive.

A solution Solution: Characterize the form of the distribution mathematically. Fit the parameters of the distribution empirically, or compute them analytically. Use the resulting distribution to compute accurate p-values.

Extreme value distribution This distribution is characterized by a larger tail on the right.

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p- value. p-value = Pr(data|null)

Extreme value distribution Compute this value for x=4.

Computing a p-value Calculator keys: 4, +/-, inv, ln, +/-, inv, ln, +/-, +, 1, = Solution: 0.018149

Scaling the EVD Gap penalty = 5 Gap penalty = 7 An extreme value distribution derived from, e.g., the Smith-Waterman algorithm will have a characteristic mode μ and scale parameter λ. These parameters depend upon the size of the query, the size of the target database, the substitution matrix and the gap penalties.

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p- value associated with 45?

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p- value associated with 45?

Summary A distribution plots the frequency of a given type of observation. The area under any distribution is 1. Most statistical tests compare observed data to the expected result according to the null hypothesis. Sequence similarity scores follow an extreme value distribution, which is characterized by a larger tail. The p-value associated with a score is the area under the curve to the right of that score.