Introduction to bioinformatics lecture 8

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Lecture 8 Alignment of pairs of sequence Local and global alignment

Introduction to Bioinformatics

Molecular Evolution Revised 29/12/06

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.

Sequence analysis course

Bioinformatics and Phylogenetic Analysis

What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Introduction to bioinformatics

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Multiple Sequence Alignments

Substitution matrices

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Chapter 5 Multiple Sequence Alignment.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Pairwise & Multiple sequence alignments

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Construction of Substitution Matrices

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Sequence Alignment.

Construction of Substitution matrices

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

BIOINFORMATICS Ayesha M. Khan Spring Lec-6.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Introduction to Bioinformatics

Introduction to bioinformatics 2007 Lecture 9

Alignment IV BLOSUM Matrices

1-month Practical Course

Presentation transcript:

Introduction to bioinformatics lecture 8 Deriving amino acid exchange matrices (II) and Multiple sequence alignment (I) In this lecture we will start with the first part of the “multiple sequence alignment chapter”: amino acid substitution matrices.

Summary Dayhoff’s PAM-matrices Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Several later groups have attempted to extend Dayhoff's methodology or re-apply her analysis using later databases with more examples. Extensions of Dayhoff’s methodology: > Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275). > Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology. > Henikoff & Henikoff (Proteins 17:49) compared these two newer versions of the PAM matrices with Dayhoff's originals.

The BLOSUM matrices (BLOcks SUbstitution Matrix) The BLOSUM series of matrices were created by Steve Henikoff and colleagues (PNAS 89:10915). Derived from local, un-gapped alignments of distantly related sequences. All matrices are directly calculated; no extrapolations are used. Again: the observed frequency of each pair is compared to the expected frequency (which is essentially the product of the frequencies of each residue in the dataset). Then: Log-odds matrix.

The Blocks Database The Blocks Database contains multiple alignments of conserved regions in protein families. Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database. The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences.

The Blocks Database Gapless alignment blocks

The BLOSUM series BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90. The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (all blocks have >=62% sequence identity); No extrapolations are made in going to higher evolutionary distances High number - closely related sequences Low number - distant sequences BLOSUM62 is the most popular: best for general alignment.

The log-odds matrix for BLOSUM62

PAM versus BLOSUM Based on an explicit evolutionary model Derived from small, closely related proteins with ~15% divergence Higher PAM numbers to detect more remote sequence similarities Errors in PAM 1 are scaled 250X in PAM 250 Based on empirical frequencies Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities Errors in BLOSUM arise from errors in alignment

Comparing exchange matrices To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value (H) which describes the amount of information available per aligned residue pair. … As two protein sequences diverge over time, information about the evolutionary process at work is lost (e.g. back mutations). Therefore, matrices with larger entropy values are more sensitive to less divergent sequences, while matrices with smaller entropy values are more sensitive to distantly related sequences.

Specialized matrices Claverie (J.Mol.Biol 234:1140) developed a set of substitution matrices designed explicitly for finding possible frameshifts in protein sequences. These matrices are designed solely for use in protein-protein comparisons; they should not be used with programs which blindly translate DNA (e.g. BLASTX, TBLASTN).

Specialized matrices Rather than starting from alignments generated by sequence comparison, Rissler et al (1988) and later Overington et al (1992) only considered proteins for which an experimentally determined three dimensional structure was available. They then aligned similar proteins on the basis of their structure rather than sequence and used the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Rissler or Overington matrices should give more reliable results than either PAM or BLOSUM. However, the comparatively small number of available protein structures (particularly in the Rissler et al study) limited the reliability of their statistics. Overington et al (1992) developed further matrices that consider the local environment of the amino acids.

A note on reliability All these matrices are designed using standard evolutionary models. It is important to understand that evolution is not the same for all proteins, not even for the same regions of proteins. No single matrix performs best on all sequences. Some are better for sequences with few gaps, and others are better for sequences with fewer identical amino acids. Therefore, when aligning sequences, applying a general model to all cases is not ideal. Rather, re-adjustment can be used to make the general model better fit the given data.

Pair-wise alignment quality versus sequence identity (Vogt et al Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, 816-831,1995) In this work, a set of amino acid sequences matched by superposition of known protein tertiary topologies is used to test the alignment accuracy of the different method/matrix/penalty combinations. The percentage identity resulted from application of the N/W (Needleman and Wunch, 1970) alignment algorithm and the gonnet p residue substitution matrix with gap penalties optimized over all the data.

Summary If ORF exists, then align at protein level. Amino acid substitution matrices reflect the log-odds ratio between the evolutionary and random model and can therefore help in determining homology via the alignment score. The evolutionary and random models depend on the generalized data used to derive them. This not an ideal solution. Apart from the PAM and BLOSUM series, a great number of further matrices have been developed. Matrices have been made based on DNA, protein structure, information content, etc. For local alignment, BLOSUM62 is often superior; for distant (global) alignments, BLOSUM50, GONNET, or (still) PAM250 work well. Remember that gap penalties are always a problem; unlike the matrices themselves, there is no formal way to calculate their values -- you can follow recommended settings, but these are based on trial and error and not on a formal framework.

Biological definitions for related sequences Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues. Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution. Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html So this means … Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Multiple sequence alignment Sequences can be conserved across species and perform similar or identical functions. > hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved; > identification of regions or domains that are critical to functionality. Sequences can be mutated or rearranged to perform an altered function. > which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

What to ask yourself How do we get a multiple alignment? (three or more sequences) What is our aim? – Do we go for max accuracy, least computational time or the best compromise? What do we want to achieve each time

Sequence-sequence alignment Example of two sequences which are aligned by the dynamic programming algorithm of Needleman-Wunsch. As you already know from earlier lectures, each sequence is placed along the sides of the matrix. Each element in the matrix represents two residues of the sequence being aligned at that position. To calculate the score in each position (i,j), one looks at the alignment that has already been made up to that point and finds the best way to continue. Having gone through the entire matrix in this way, one can go back and trace which way through the matrix gives the best alignment.

Multiple alignment methods Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Simultaneous multiple alignment Multi-dimensional dynamic programming The combinatorial explosion 2 sequences of length n n2 comparisons Comparison number increases exponentially i.e. nN where n is the length of the sequences, and N is the number of sequences Impractical for even a small number of short sequences

Multi-dimensional dynamic programming (Murata et al., 1985) Sequence 1 Sequence 3 Sequence 2

The MSA approach MSA (Lipman et al., 1989, PNAS 86, 4412) MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube. Calculate all pair-wise alignment scores. Use the scores to to predict a tree. Calculate pair weights based on the tree (lower bound). Produce a heuristic alignment based on the tree. Calculate the maximum weight for each sequence pair (upper bound). Determine the spatial positions that must be calculated to obtain the optimal alignment. Perform the optimal alignment. Report the weight found compared to the maximum weight previously found (measure of divergence). Extremely slow and memory intensive. Max 8-9 sequences of ~250 residues.

The DCA approach DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2), 67-73) Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. This procedure is re-iterated until the sequences are sufficiently short. Optimal alignment by MSA. Finally, the resulting short alignments are concatenated.

So in effect … Sequence 1 Sequence 3 Sequence 2

Multiple alignment methods Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

The progressive alignment method Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree. But before going into details, some notices of multiple alignment profiles … Progressive methods do not optimize a score function!

How to represent a block of sequences? Historically: consensus sequence – single sequence that best represents the amino acids observed at each alignment position. Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position.

Multiple alignment profiles (Gribskov et al. 1987) Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins). Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe). The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap-opening and gap-extension).

Multiple alignment profiles Core region Gapped region Core region i A C D  W Y fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. - Gapo, gapx Gapo, gapx Gapo, gapx Position dependent gap penalties

Profile building Position dependent gap penalties A C D  W Y 0.5  Example: each aa is represented as a frequency penalties as weights. i A C D  W Y 0.5  0.3 0.1  0.5 0.2  0.1 NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. Gap penalties 1.0 0.5 1.0 Position dependent gap penalties

Profile-sequence alignment ACD……VWY

Sequence to profile alignment V L 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V)

Profile-profile alignment C D . Y profile ACD……VWY

Profile to profile alignment 0.4 V 0.75 G 0.25 S Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

So, for scoring profiles … Think of sequence-sequence alignment. Same principles but more information for each position. Reminder: The sequence pair alignment score S comes from the sum of the positional scores M(aai,aaj) (i.e. the substitution matrix values at each alignment position minus penalties if applicable) Profile alignment scores are exactly the same, but the positional scores are more complex