Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
BLAST - Basic Local Alignment Search Tool Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman Originally for ungapped local comparison of sequences.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment III CIS 667 February 10, 2004.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Matrices Write and Augmented Matrix of a system of Linear Equations Write the system from the augmented matrix Solve Systems of Linear Equations using.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Fast Sequence Alignments
Pairwise Sequence Alignment
Basic Local Alignment Search Tool (BLAST)
Constructing Probability Matrices
Presentation transcript:

Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine with probability.2 Alanine Serine with probability.1 Leucine Serine with probability.3 We will assume that these probabilities are for changes that take place during one time unit

We can summarize these observations using the language of probability theory. We will use the notation (A|L, t) to mean: “A certain position in our sequence initially contains Leucine and at time, t, it contains Alanine.” Another way of saying this is, “After t time units the position contains Alanine given that it initially contained Leucine.”, i.e. the vertical bar means “given” So, Alanine given Leucine after t time units. We then write: Pr(A|A, 1) =.7 Pr(A|L, 1) =.2 Pr(A|S, 1) =.1 Pr(L|A, 1) =.2 Pr(L|L, 1) =.5 Pr(L|S, 1) =.3 Pr(S|A, 1) =.1 Pr(S|L, 1) =.3 Pr(S|S, 1) =.6 The above can be summarized in a table, called a matrix 1\2 ALS A L S.1.3.6

What about the probabilities two time units later? For example what is the probability that a position that was originally Alanine is Alanine two time units later? This can happen in three ways: A A A L A S A In our original notation, we are saying: (A|A, 2) = (A|A, 1)and(A|A, 1) or (L|A, 1)and(A|L, 1) or (S|A, 1)and(A|S, 1) Thus, to compute the probability, Pr(A|A,2) = Pr(A|A,1)Pr(A|A,1) + Pr(L|A,1)Pr(A|L,1) + Pr(S|A,1)Pr(A|S,1) =.7*.7 +.2*.2 +.1*.1 = =.54 We will work out the 8 other second time unit transition probabilities in class.

ALS A L S After we compute all 9 of the probabilities for the transitions after 2 time units we have the following table. This table required three multiplications and two additions to compute the values placed in each of its nine cells. That is there where 27 multiplications and 18 additions required to produce the above table.

The Matrix Connection Consider the matrix, M, that we constructed earlier when we made the table of probabilities In matrix algebra, the product of two matrices is defined as follows: To compute the product of two matrices A and B, the value placed in row, i, and column, j, is obtained by multiplying each value in row, i, of A by its corresponding element in column, j, of B and summing the results. Translation by way of an illustration to follow.

Let’s suppose we want to square M, i.e. multiply M by itself To compute the value of the product matrix M 2 in row, 2, column, 3, we multiply each element in row 2 of the first matrix by its corresponding element in row 3 of the second matrix and sum the results:.2*.1 +.5*.3 +.3*.6 = =.35 But this is exactly how we calculated Pr(S|L, 2)! This agreement between M 2 and the table of transition probabilities holds for each position. It appears that Matrix Multiplication is exactly what we need to generate the table of transition probabilities after t time units.

Thus, if we use the rules of matrix multiplication, Since the rules of matrix multiplication and those for computing the transition probabilities are essentially the same, we have a marriage made by the divine. So let’s use them to our advantage.

Recall, the PAM1 probability matrix from last period. This is not symmetric (the values across the diagonal equal), but the rules for transition probabilities and matrix multiplication are the same. Therefore we can apply our previous observations.

This matrix has 20 rows and 20 columns. To multiply it times itself would require 20 multiplications and 19 additions to compute the value in each of its 400 positions. That is: 400*(20 +19) = = 15,600 operations That just gets us to the PAM2 matrix. Most applications use the PAM250 matrix which means: 250*15,600 = 3,900,000 operations This is a hefty load even with a computer to say nothing of lost accuracy due to computer word size limitations. Fortunately, Matrix Algebra, has ways of cutting way down on the number of operations. To learn more, see your local friendly neighborhood mathematician.

Back to the Topic of Sequence Alignments

We now leave the realm of exact alignments and adopt some Heuristics (an exploratory problem solving method based on experience and relying on past results for improvement of the technique. NOTE – not an algorithm that has been proven to be correct). BLAST - Basic Local Alignment Search Tool Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman Originally for ungapped local comparison of sequences. It has since been expanded to involve comparisons of gapped sequences. There have been several extensions of the technique and improvements to the basic tool throughout the 14 years of its life thus far.

Needleman-Wunsch, SemiGlobal Alignment, and Smith-Waterman assume we know which two sequences we need to compare. BLAST is designed to do a database search for possible matching sequences: 1.There is no known starting point to begin the matches 2.There is not a well established format for the information stored in the data 3.It is like searching for a file in a cluttered office – see Professor Leinbach’s or Professor James’ offices for reference. The amazing thing is that BLAST has been so successful!

Consider this sequence gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa We want to search for possible matches to gain clues to its identity 1.Place a sliding window 11 or 12 nucleotides long over the sequence gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa 2.Extract the window subsequence and compress it to 3 bytes Code a as 00 2 c as 01 2 g as 10 2 and t as 11 2 Thus, the 11 characters take up 22 bits – 3 bytes with two bits unused 3.Using a Finite State Automaton, eliminate subsequences that occur with a very high frequency in the database 4.If the subsequence survives, i.e. is determined to be relatively rare, use a hash table to locate sequences in the database that contain that 11 nucleotide subsequence. 5.Extend the match in both directions scoring the extension until the match drops below a predetermined threshold. If it survives for the length of the original sequence – report the result 6.Slide the window down one character and repeat steps The above is only an approximate description of the BLAST algorithm.

Here is one of the BLAST results for our sequence: Score = 44.1 bits (22), Expect = Identities = 34/38 (89%), Gaps = 0/38 (0%) Strand=Plus/Plus Query 12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49 ||||| || ||||||||| | ||||||||||||||||| Sbjct 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067 With a report of: gi| |gb|DQ |gi| |gb|DQ | Physcomitrella patens DNA mismatch repair protein MSH2 gene The result along with 68 others was reported in about 30 seconds of searching on a Friday afternoon before a holiday. The search involved over 3 million subject sequences accounting for over 16 billion characters!

For protein sequences the window is 4 amino acids long: 1 Amino Acid = 3 Nucleotides 4 Amino Acids = 12 Nucleotides = 8 Bytes coded From Krane and Raymer page 49: