Introduction to bioinformatics 2007 Lecture 9

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Local alignment
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise & Multiple sequence alignments
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to bioinformatics lecture 8
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Multiple Sequence Alignment
Pairwise sequence Alignment.
1-month Practical Course
Introduction to Bioinformatics
Introduction to bioinformatics Lecture 8
1-month Practical Course
Presentation transcript:

Introduction to bioinformatics 2007 Lecture 9 G A V B M S U Introduction to bioinformatics 2007 Lecture 9 Multiple Sequence Alignment (I)

Biological definitions for related sequences Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues. Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution. Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc. Vertical transfer is caused by (normal) heredity

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html So this means … Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Information content of a multiple alignment Sequences can be conserved across species and perform similar or identical functions hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved identification of regions or domains that are critical to functionality Sequences can be mutated or rearranged to perform an altered function which changes in the sequences have caused a change in the functionality

Multiple alignment idea Take three or more related sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment. Ideally, the sequences are orthologous, but often include paralogues.

Scoring a multiple alignment You can score a multiple alignment by taking all the pairs of aligned sequences and add up the pairwise scores: Sa,b = - This is referred to as the Sum-of-Pairs score

Multiple sequence alignment Why? It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) Many bioinformatics methods depend on it (e.g. secondary/tertiary structure prediction)

Information content of a multiple alignment   

What to ask yourself How do we get a multiple alignment? (three or more sequences) What is our aim? Do we go for max accuracy? Least computational time? Or the best compromise? What do we want to achieve each time?

Multiple alignment methods Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Exhaustive & Heuristic algorithms Exhaustive approaches Examine all possible aligned positions simultaneously Look for the optimal solution by (multi-dimensional) DP Very (very) slow Heuristic approaches Strategy to find a near-optimal solution (by using rules of thumb) Shortcuts are taken by reducing the search space according to certain criteria Much faster

Simultaneous multiple alignment Multi-dimensional dynamic programming Combinatorial explosion DP using two sequences of length n n2 comparisons Number of comparisons increases exponentially i.e. nN where n is the length of the sequences, and N is the number of sequences Impractical even for small numbers of short sequences

Sequence-sequence alignment by Dynamic Programming Example of two sequences which are aligned by the dynamic programming algorithm of Needleman-Wunsch. As you already know from earlier lectures, each sequence is placed along the sides of the matrix. Each element in the matrix represents two residues of the sequence being aligned at that position. To calculate the score in each position (i,j), one looks at the alignment that has already been made up to that point and finds the best way to continue. Having gone through the entire matrix in this way, one can go back and trace which way through the matrix gives the best alignment.

Multi-dimensional dynamic programming (Murata et al., 1985) Sequence 1 Sequence 3 Sequence 2

The MSA approach Lipman et al. 1989 Key idea: restrict the computational costs by determining a minimal region within the n-dimensional matrix that contains the optimal path

The MSA method in detail Let’s consider 3 sequences Calculate all pair-wise alignment scores by Dynamic programming Use the scores to predict a tree Produce a heuristic multiple align. based on the tree (quick & dirty) Calculate maximum cost for each sequence pair from multiple alignment (upper bound) & determine paths with < costs. Determine spatial positions that must be calculated to obtain the optimal alignment (intersecting areas or ‘hypersausage’ around matrix diagonal) Note Redundancy caused by highly correlated sequences is avoided . 1 2 3 NB for redundancy: Lipman et al. used weighting schemes of the aligned sequences (based on phylogenetic trees) as similar sequences should not dominate the multiple sequence alignment.

The DCA (Divide-and-Conquer) approach Stoye et al. 1997 Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. This procedure is re-iterated until the sequences are sufficiently short. Optimal alignment by MSA. Finally, the resulting short alignments are concatenated.

So in effect …

Multiple alignment methods Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

The progressive alignment method Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree. But before going into details, some notices of multiple alignment profiles … Progressive methods do not optimize a score function!

Making a guide tree Pairwise alignments (all-against-all) Similarity 1 Score 1-2 Pairwise alignments (all-against-all) 2 1 Score 1-3 3 4 Score 4-5 5 Similarity criterion Similarity matrix Scores 5×5 Guide tree

Progressive multiple alignment 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities Guide tree Multiple alignment

General progressive multiple alignment technique (follow generated tree) Align these two d 1 3 These two are aligned 1 3 2 5 1 3 2 5 root 1 3 2 5

PRALINE progressive strategy d 1 3 1 3 2 1 3 2 5 PRALINE is a global progressive alignment algorithm that re-evaluates at each alignment step which sequences or blocks of sequences should be aligned, and hence determines the order in which sequences should be aligned on the fly. Second, by creating pre-profiles, distant sequences are no longer considered independently at the last alignment step. 4 1 3 2 5 4 At each step, Praline checks which of the pair-wise alignments (sequence-sequence, sequence-profile, profile-profile) has the highest score – this one gets selected

Progressive alignment strategy B C D E All individual pairwise alignment and construction of distance matrix A B C D E — 11 20 30 27 36 9 33 Calculating a guide tree; C & D the closest pair; A & B the next closest pair A B C D E A B C D Aligning C/D and A/B separately using dynamic programming Figure adapted from Xiong, J. “Essential Bioinformatics”

But how can we align blocks of sequences ? D E A B C D ? The dynamic programming algorithm performs well for pairwise alignment (two axes). So we should try to treat the blocks as a “single” sequence …

How to represent a block of sequences ? Historically: consensus sequence single sequence that best represents the amino acids observed at each alignment position. Modern methods: alignment profile representation that retains the information about frequencies of amino acids observed at each alignment position.

Consensus sequence Problem: loss of information F A T N M G T S D P P T H T R L R K L V S Q Sequence 2 F V T N M N N S D G P T H T K L R K L V S T Consensus F * T N M * * S D * P T H T * L R K L V S * Problem: loss of information For larger blocks of sequences it “punishes” more distant members For example choose between: 0.5 * s(A,V) + 0.5 * s (V,V) 0.5 * s(A,A) + 0.5 * s (A,V) Or even “intermediate” residue can be chosen

Alignment profiles Advantage: full representation of the sequence alignment (more information retained) Not only used in alignment methods, but also in sequence-database searching (to detect distant homologues) Also called PSSM (Position-specific scoring matrix) Loss of information: e.g. motifs: all positions are distributed independent (i.i.D.)

Multiple alignment profiles (Gribskov et al. 1987) Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins). Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe). The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap-opening and gap-extension).

Multiple alignment profiles Core region Gapped region Core region i A C D  W Y fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. - Gapo, gapx Gapo, gapx Gapo, gapx Position-dependent gap penalties

Profile building Position dependent gap penalties A C D  W Y 0.5  Example: each aa is represented as a frequency and gap penalties as weights. i A C D  W Y 0.5  0.3 0.1  0.5 0.2  0.1 NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. Gap penalties 1.0 0.5 1.0 Position dependent gap penalties

Profile-sequence alignment ACD……VWY

Sequence to profile alignment V L 0.4 A 0.2 L 0.4 V Score of amino acid L in a sequence that is aligned against this profile position: Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V)

Profile-profile alignment C D . Y profile ACD……VWY

Profile to profile alignment 0.4 V 0.75 G 0.25 S Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

So, for scoring profiles … Think of sequence-sequence alignment. Same principles but more information for each position. Reminder: The sequence pair alignment score S comes from the sum of the positional scores M(aai,aaj) (i.e. the substitution matrix values at each alignment position minus penalties if applicable) Profile alignment scores are exactly the same, but the positional scores are more complex

Scoring a profile position D . Y A C D . Y At each position (column) we have different residue frequencies for each amino acid (rows) SO: Instead of saying S=M(aa1, aa2) (one residue pair) For frequency f>0 (amino acid is actually there at least once) we take:

Log-average score Remember the substitution matrix formula? In log-average scoring (von Ohsen et al, 2003) What is the effect? Mathematically correct, but does it work? (sum of all logs)

Progressive alignment strategy Perform pair-wise alignments of all of the sequences (all against all); Use the alignment scores to make a similarity (or distance) matrix Use that matrix to produce a guide tree; Align the sequences successively, guided by the order and relationships indicated by the tree. Methods: Biopat (Hogeweg and Hesper 1984 -- first integrated method ever) MULTAL (Taylor 1987) DIALIGN (1&2, Morgenstern 1996) PRRP (Gotoh 1996) ClustalW (Thompson et al 1994) PRALINE (Heringa 1999) T Coffee (Notredame 2000) POA (Lee 2002) MUSCLE (Edgar 2004) PROBSCONS (Do, 2005)