Bioinformatics in Biosophy

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
BIOINFORMATICS Sequence Comparison
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Bioinformatics in Biosophy 1 PAM & Blosum Park, Jong Hwa MRC-DUNN Hills Road Cambridge CB2 2XY England Bioinformatics in Biosophy Next: 02/06/2001

Family business and Matrix 1. Proteins can always be grouped together. 2. As they can be grouped (clustered), there must be a way to represent a whole group of proteins. 3. How are we going to represent the groups of proteins to represent some important information effectively? 4. In Bioinformatics, multiple sequence alignment is one of the most common way of representing the whole groups. (Why?) 5. Once alignments are made, there are so many things to do, to know the protein families better. (Why?) 6. Clustering proteins in various other ways.

Some Blocks in Alignments

From sequence Blocks  PAM matrix We use PAM units to measure the amount of evolutionary distance between two amino acid sequences (at least 85% identical) Two sequences S1 and S2 are said to be one PAM unit diverged if a series of accepted point mutations (and no insertions of deletions) has converted S1 to S2 with an average of one accepted point-mutation event per 100 amino acids in the two sequences. The term ``accepted'' here means a mutation that was incorporated into the protein and passed to its progeny. Therefore, either the mutation did not change the function of the protein or the change in the protein was beneficial to the organism.

PAM Note that two sequences which are one PAM unit diverged do not necessarily differ in one percent, as often mistakenly thought, because a single position may undergo more than a single mutation. The difference between the two notions grows as the number of units does.

2 problems with PAM There are two main problems with the notion of the PAM units: 1. First, practically all the sequences we can obtain today are extracted from extant organisms. We almost do not know any protein sequences where one is actually derived from the other. The lack of ancestral protein sequences is handled by assuming that amino acid mutations are reversible and equally likely in either direction. This assumption, together with the additivity property of the PAM units derived from its definition, imply that given two amino acid sequences: Si and Sj whose mutual ancestor is Sij we have: d(Si,Sj) = d(Si,Sij) + d(Sij,Sj) when d(i,j) is the PAM distance between amino acid sequences i and j.

The InDel problem 2. The second problem, which is more difficult to overcome, is that we disregard here insertions and deletions which may occur during evolution, hence we can not be sure of the correct correspondence between sequence positions. In order to know the exact correspondence one has to be able to identify the true historical gaps, or, at least to identify large intervals along the two sequences where the correspondence is correct. This can not always be done with certainty, especially when the two sequences are distantly diverged.

From seq. Blocks to BLOSUM The first stage of building a matrix is eliminating sequences, which are identical in more than x% of their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. The matrix built from blocks with no more the x% of similarity is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity is called BLOSUM-50.) http://www.math.tau.ac.il/~rshamir/algmb/scribe/html/lec03/node10.html

Building Blocks Matrix (Blosum) The second stage is counting the pairs of amino acids in each column of the multiple alignment. For example in a column with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability qi,j for a pair of amino acids in the same column to be Ai and Aj is calculated, as well as the probability pi of a certain amino acid to be Ai. In the third stage the log odd ratio is calculated. As final result we consider the rounded 2si,j, this value is stored in the (i,j) entry of the BLOSUM-x matrix. We take the log value of the probability in order to allow computing the total score of all substitutions using summation rather than multiplication.

Making Blosum In contrast to the PAM matrices, more sequences are examined in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ. Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but found by the other. According to this comparison BLOSUM-62 is found to be better than other BLOSUM-x matrices as well as than PAM-x matrices.

Making New FASTA exchange matrix Get some structural or sequence alignments Find non-gap blocks in the alignments Count the occurrences of mutations for single amino acid Calculate the log odd ratio of expected frequency vs observed frequency Produce a FASTA format Matrix Use your own matrix to run a FASTA program (Fasta -s option). fasta -s my.mat query.spfa lib.spfa

Making your own Matrix 2 http://bioinfo.mbb.yale.edu/align/scop/tables/allStrAlignments.txt Let’s make a Multiple Residue Column (MRC) Matrix. Get the alignment file: bioinfo-238/teaching mat/practical_ resource/ Parse the alignments to get dipeptide entries and occurrences Make 400x400 matrix with occurrences(frequency) Calculate the log odd ratio of all the dipeptides Display the rankings of similar non-identical dipeptides.

How to improve exchange Matrix? Use sequence alignment from the best people or structure/sequence align algorithm Use very large set of alignments Apply some protein family specific matrix (for example, transmembrane proteins) Use multiple residue columns to make the Matrix

PAM to quantify homology automatically Problem: we have two sequences SEQVENCE STRVCTVT How similar or Homologous are they? A simple way: Compositional Identity Another Simple way: Sequence Identity Advanced way: Similarity Score Most advanced: Statistical Significance

Simple Sequence Similarity A simple way: Compositional Identity Compositional ID: SEQVENCE <> STRVCTVT Shows how similar the sequences without giving order information Another Simple way: Sequence Identity Simple to understand, but far from accurate To have more meaningful similarity, we can use PAM, BLOSUM, your own MATRICES,,,

Similarity Scores for a pair of sequences Advanced way: Similarity Score Align the two sequences Calculate the score based on Matrices used. Most advanced: Statistical Significance Produces some random sequences for one of the sequences and compare them to the other sequences many times (~100) and check what is the control level alignment score. Onc can use reversed sequences instead of random sequences.

How to count similarity? If two sequences are very similar, we can just count the identical and similar sequences and calculate score ABCDEKKKMLOVE ** *.**:.: * ABEDCKKRNLIKE However, biological sequences have Insertions and Deletions. ABCDEKKKABCDEMLOVE ABEDCKKR-----NLOVE

How do we know the correct Insertions and Deletions positions? Normally it is not easy when sequences are very distant. Below twilight zone => Nearly impossible to have correct alignment. What is the ‘correct’ alignment?  This is a non-sense for diverged sequences Around and below 40% sequence identity, most sequence alignments are WRONG. The wrong regions are mostly Insertion and Deletion locations (gaps) The best we can do is to find where the gaps are and give up perfect Alignment. Putting gaps correctly in any sequence alignment is NOT solved completely.

Finding gaps as best as we can Usually, the best method is using NNN. NNN is time consuming. So, Needleman & Wunsch made a computer algorithm. Simple Dynamic Programming (J. Mol. Biol. 48, 443-453 [1970])

Ref: http://www.finchcms.edu/biochem/Walters/nw.html Let's do a simple example, adapted from Needleman & Wunsch's original paper. First, place the sequences on a matrix of cells. At each cell where the amino acids are identical, enter a value of 1. All the other cells are implicitly given a score of 0 (zero). A D C N S R Q C L C R P M A 1 S 1 C 1 1 1 N 1 R 1 1 K D 1 P 1 Ref: http://www.finchcms.edu/biochem/Walters/nw.html

Now, starting at the C-terminal ends of the sequences and working toward the origins, add to each cell the maximum value from among all the cells downstream from it (not including cells directly below or directly to the right. Let's do a few cycles of this and see how the matrix develops. Start with the last column and last row, adding in the zeros. A D C N S R Q C L C R P M A 1 0 S 1 0 C 1 1 1 0 N 1 0 R 1 1 0 K 0 D 1 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0

A D C N S R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 S 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 S 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 4 4 3 3 2 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 D 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0

ADCNS-RQCLCR-PM | | | | | || | ASC-SNR-CKCRDP- ADC-NSRQCLCR-PM http://www.perlmonth.com/columns/modules/Perl/Align/NW/NW.htm Additional ref: Smith, T.F. and Waterman, M.S. 1981. ``Identification of common molecular subsequences'' Journal of Molecular Biology. 147: 195-197

FASTA FASTA is a program for rapid alignment of pairs of protein and DNA sequences. Rather than comparing individual residues in the two sequences, FASTA instead looks for matching sequence patterns or words, called k-tuples, and then attempts to build a local alignment based upon these word matches.

FASTA In the initial search for regions of similarity, FASTA uses a computer method known as hash coding. In this method, a lookup table showing the positions of each sequence word of length k, called a k-tuple, is constructed for each sequence. The relative positions of each word in the two sequences is then calculated by subtracting the position in the first sequence from that in the second. Words that have the same offset position reveal a region of alignment between the two sequences. The number of comparisons increases linearly in proportion to average sequence length. In contrast, the time taken in dot matrix and dynamic programming methods increases as the square of the average sequence length. The k-tuple length is user-defined and is usually 1 or 2 for protein sequences (i.e. either the positions of each of the individual 20 amino acids or the positions of each of the 400 possible dipeptides are located).

FASTA position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offset amino acid protein A protein B pos A - posB ----------------------------------------------------- a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 - Note the common offset for the 3 amino acids c,s and p A possible alignment is thus quickly found - protein 1 n c s p t a | | | protein 2 a c s p r k