Sequence analysis 2005 - lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Sequence similarity.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Introduction to bioinformatics lecture 8
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
1-month Practical Course
Introduction to Bioinformatics
Introduction to bioinformatics 2007 Lecture 9
Introduction to bioinformatics Lecture 8
1-month Practical Course
Presentation transcript:

Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods

Sequence analysis lecture 6 Biological definitions for related sequences  Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues.  Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution.  Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event.  Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

Sequence analysis lecture 6 Source: So this means …

Sequence analysis lecture 6 Multiple sequence alignment  Sequences can be conserved across species and perform similar or identical functions. > hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved; > identification of regions or domains that are critical to functionality.  Sequences can be mutated or rearranged to perform an altered function. > which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

Sequence analysis lecture 6 What to ask yourself  How do we get a multiple alignment? (three or more sequences)  What is our aim? – Do we go for max accuracy, least computational time or the best compromise?  What do we want to achieve each time

Sequence analysis lecture 6 Sequence-sequence alignment sequence

Sequence analysis lecture 6 Multiple alignment methods  Multi-dimensional dynamic programming > extension of pairwise sequence alignment.  Progressive alignment > incorporates phylogenetic information to guide the alignment process  Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Sequence analysis lecture 6 Simultaneous multiple alignment Multi-dimensional dynamic programming The combinatorial explosion  2 sequences of length n  n 2 comparisons  Comparison number increases exponentially  i.e. n N where n is the length of the sequences, and N is the number of sequences  Impractical for even a small number of short sequences

Sequence analysis lecture 6 Multi-dimensional dynamic programming (Murata et al., 1985) Sequence 1 Sequence 2 Sequence 3

Sequence analysis lecture 6 The MSA approach  MSA (Lipman et al., 1989, PNAS 86, 4412)  MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube.  Calculate all pair-wise alignment scores.  Use the scores to to predict a tree.  Calculate pair weights based on the tree (lower bound).  Produce a heuristic alignment based on the tree.  Calculate the maximum weight for each sequence pair (upper bound).  Determine the spatial positions that must be calculated to obtain the optimal alignment.  Perform the optimal alignment.  Report the weight found compared to the maximum weight previously found (measure of divergence).  Extremely slow and memory intensive.  Max 8-9 sequences of ~250 residues.

Sequence analysis lecture 6 The DCA approach  DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2), 67-73)  Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint.  This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences.  This procedure is re-iterated until the sequences are sufficiently short.  Optimal alignment by MSA.  Finally, the resulting short alignments are concatenated.

Sequence analysis lecture 6 So in effect … Sequence 1 Sequence 2 Sequence 3

Sequence analysis lecture 6 Multiple alignment methods  Multi-dimensional dynamic programming > extension of pairwise sequence alignment.  Progressive alignment > incorporates phylogenetic information to guide the alignment process  Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Sequence analysis lecture 6 The progressive alignment method  Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related.  Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree.  But before going into details, some notices of multiple alignment profiles …

Sequence analysis lecture 6 How to represent a block of sequences?  Historically: consensus sequence – single sequence that best represents the amino acids observed at each alignment position.  Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position.

Sequence analysis lecture 6 Multiple alignment profiles (Gribskov et al. 1987)  Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins).  Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe).  The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap- opening and gap-extension).

Sequence analysis lecture 6 Multiple alignment profiles ACDWYACDWY - i fA.. fC.. fD..  fW.. fY.. Gapo, gapx Position dependent gap penalties Core region Gapped region Gapo, gapx fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY..

Profile building  Example: each aa is represented as a frequency penalties as weights. ACDWYACDWY Gap penalties i  Position dependent gap penalties  

Profile-sequence alignment ACD……VWY sequence

Sequence to profile alignment AAVVLAAVVL 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) * s(L, L) * s(L, V)

Profile-profile alignment ACD..YACD..Y ACD……VWY profile

Profile to profile alignment 0.4 A 0.2 L 0.4 V Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) *0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y) 0.75 G 0.25 S

Sequence analysis lecture 6 So, for scoring profiles …  Think of sequence-sequence alignment.  Same principles but more information for each position. Reminder:  The sequence pair alignment score S comes from the sum of the positional scores M(aa i,aa j ) (i.e. the substitution matrix values at each alignment position minus penalties if applicable)  Profile alignment scores are exactly the same, but the positional scores are more complex

Sequence analysis lecture 6 Scoring a profile position  At each position (column) we have different residue frequencies for each amino acid (rows) SO:  Instead of saying S=M(aa 1, aa 2 ) (one residue pair)  For frequency f>0 (amino acid is actually there) we take: ACD..YACD..Y Profile 1 ACD..YACD..Y Profile 2

Sequence analysis lecture 6 Log-average score  Remember the substitution matrix formula?  In log-average scoring (von Ohsen et al, 2003)  Why is this so important? Think about it…

Sequence analysis lecture 6 Progressive alignment 1.Perform pair-wise alignments of all of the sequences; 2.Use the alignment scores to produces a dendrogram using neighbour-joining methods (guide-tree); 3.Align the sequences sequentially, guided by the relationships indicated by the tree. Biopat (first method ever) MULTAL (Taylor 1987) DIALIGN (1&2, Morgenstern 1996) PRRP (Gotoh 1996) ClustalW (Thompson et al 1994) PRALINE (Heringa 1999) T Coffee (Notredame 2000) POA (Lee 2002) MUSCLE (Edgar 2004)

Progressive multiple alignment Guide treeMultiple alignment Score 1-2 Score 1-3 Score 4-5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities

General progressive multiple alignment technique (follow generated tree) d root

PRALINE progressive strategy d 4

There are problems … Accuracy is very important !!!!  Alignment errors during the construction of the MSA cannot be repaired anymore: propagated into the progressive steps.  The comparisons of sequences at early steps during progressive alignments cannot make use of information from other sequences.  It is only later during the alignment progression that more information from other sequences (e.g. through profile representation) becomes employed in the alignment steps. “Once a gap, always a gap” Feng & Doolittle, 1987