Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Searching Sequence Databases
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Sequence alignment, E-value & Extreme value distribution
CSE182-L5: Scoring matrices Dictionary Matching
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Vineet Bafna. How can we compute the local alignment itself?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Dicitionary matching Pattern matching
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching

Fa05CSE 182 Class Mailing List To subscribe, send to You can subscribe from the course web page Use the list for all course related queries, discussions,…

Fa05CSE 182 Silly Quiz Name a famous Bioinformatics Researcher Name a famous Bioinformatics Researcher who is a woman

Fa05CSE 182 Scoring DNA DNA has structure.

Fa05CSE 182 DNA scoring matrices So far, we considered a simple match/mismatch criterion. The nucleotides can be grouped into Purines (A,G) and Pyrimidines. Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)

Fa05CSE 182 Scoring proteins –Scoring protein sequence alignments is a much more complex task than scoring DNA Not all substitutions are equal –Problem was first worked on by Pauling and collaborators –In the 1970s, Margaret Dayhoff created the first similarity matrices. “One size does not fit all” Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant Different proteins might evolve at different rates and we need to normalize for that

Fa05CSE 182 PAM 1 distance Two sequences are 1 PAM apart if they differ in 1 % of the residues. PAM 1 (a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 1% mismatch

Fa05CSE 182 PAM1 matrix Align many proteins that are very similar –Is this a problem? PAM1 distance is the probability of a substitution when 1% of the residues have changed Estimate the frequency P b|a of residue a being substituted by residue b. S(a,b) = log 10 (P ab /P a P b ) = log10(P b|a /P b )

Fa05CSE 182 PAM 1

Fa05CSE 182 PAM distance Two sequences are 1 PAM apart when they differ in 1% of the residues. When are 2 sequences 2 PAMs apart? 1 PAM 2 PAM

Fa05CSE 182 Higher PAMs PAM 2 (a,b) = ∑ c PAM 1 (a,c). PAM 1 (c,b) PAM 2 = PAM 1 * PAM 1 (Matrix multiplication) PAM 250 –= PAM 1 *PAM 249 –= PAM 1 250

Fa05CSE 182 Note: This is not the score matrix: What happens as you keep increasing the power?

Fa05CSE 182 Scoring using PAM matrices Suppose we know that two sequences are 250 PAMs apart. S(a,b) = log 10 (P ab /P a P b )= log 10 (P b|a /P b ) = log 10 (PAM 250 (a,b)/P b )

Fa05CSE 182 BLOSUM series of Matrices Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. –In practice BLOSUM62 seems to work very well.

Fa05CSE 182 PAM vs. BLOSUM What is the correspondence? PAM1 Blosum1 PAM2 Blosum2 Blosum62 PAM250 Blosum100

Fa05CSE 182 Dictionary Matching, R.E. matching, and position specific scoring

Fa05CSE 182 Dictionary Matching Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O dictionary database

Fa05CSE 182 Dict. Matching & string matching How fast can you do it, if you only had one word of length m? –Trivial algorithm O(nm) time –Pre-processing O(m), Search O(n) time. Dictionary matching –Trivial algorithm (l 1 +l 2 +l 3 …)n –Using a keyword tree, l p n (l p is the length of the longest pattern) –Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case

Fa05CSE 182 Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well.

Fa05CSE 182 PO T A TO T UISM SET A The Trie Automaton Construct an automaton A from the dictionary –A[v,x] describes the transition from node v to a node w upon reading x. –A[u,’T’] = v, and A[u,’S’] = w –Special root node r –Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE w vu S r

Fa05CSE 182 An O(l p n) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition –Increment current pointer –Move to a new node –If terminal node “success” Else –Retract ‘current’ pointer –Increment ‘start’ pointer –Move to root & repeat

Fa05CSE 182 Illustration: PO T A TO T UISM SET A P O T A S T P O T A T O l c v S 1

Fa05CSE 182 Idea for improving the time P O T A S T P O T A T O Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match –Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l c 1:POTATO 2:POTASSIUM 3:TASTE P O T A S S I U M T A S T E Pattern i Pattern j

Fa05CSE 182 Improving speed of dictionary matching Every node v corresponds to a string s v that is a prefix of some pattern. Define F[v] to be the node u such that s u is the longest suffix of s v If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = |s u | PO T A TO T UISM SET A S

Fa05CSE 182 An O(n) alg. For keyword matching Start with the first position in the db, and the root node. If successful transition –Increment current pointer –Move to a new node –If terminal node “success” Else (if at root) –Increment ‘current’ pointer –Mv ‘start’ pointer –Move to root Else –Move ‘start’ pointer forward –Move to failure node

Fa05CSE 182 Illustration P O T A S T P O T A T O POTATO T UISM SET A lc v S 1

Fa05CSE 182 Time analysis In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n P O T A S T P O T A T O lc

Fa05CSE 182 Blast: Putting it all together Input: Query of length m, database of size n Select word-size, scoring matrix, gap penalties, E-value cutoff

Fa05CSE 182 Blast Steps 1.Generate an automaton of all query keywords. 2.Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3.Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4.For each alignment with score S, compute the bit-score, E- value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5.Output results.

Fa05CSE 182 Protein Sequence Analysis What can you do if BLAST does not return a hit? –Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. A: Accept hits at higher P-value. –This increases the probability that the sequence similarity is a chance event. –How can we get around this paradox? –Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?