Multiple Sequence Comparison.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Structural bioinformatics
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple Sequence Alignment
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple sequence alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence Alignment 11/24/2018.
Sequence Based Analysis Tutorial
Presentation transcript:

Multiple Sequence Comparison

/course/eleg667-01-f/Topic-2c Outline Motivation Multiple Sequence Alignment using Dynamic programming Multiple Sequence Alignment using Heuristics Star Alignments Tree Alignments (CLUSTAL W) PSI-BLAST and multiple sequence alignment Evaluation of Alignment Methods Summary /course/eleg667-01-f/Topic-2c

Pair-wise sequence comparison “In biomolecular sequences (DNA, RNA, Protein), high sequence similarity usually implies significant functional or structural similarity.” Underlies the effectiveness of pair-wise sequence comparison and of biological database searching Find sequences that have common sub-patterns but may not have been known to be biologically related. /course/eleg667-01-f/Topic-2c

Multiple sequence comparison “Evolutionarily and functionally related molecular sequences can differ significantly at the sequence level and yet preserve similar function and/or structure. Underlies the effectiveness of multiple sequence comparison. Deduce unknown conserved patterns from a set of sequences already known to be biologically related. /course/eleg667-01-f/Topic-2c

Common MSA Applications Characterization and representation of protein families and later identification of other potential members of the family; Identification and representation of conserved sequence features that correlate with structure and function; Deduction of evolutionary history. /course/eleg667-01-f/Topic-2c

Common MSA Applications To detect/demonstrate homology between new sequences and existing families of sequences To help predict the secondary and tertiary structures of new sequences To suggest oligonucleotide primers for PCR /course/eleg667-01-f/Topic-2c

Comparing Multiple Sequences We can compare multiple sequences by aligning the sequences and assigning a score to the alignments Multiple Sequence Alignment (MSA) /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Application of MSA Homology Search (e.g. BLAST) Database top scoring hits MSA Conserved regions Evolution paths …... /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Definition of MSA A Multiple Sequence Alignment is obtained by inserting into each sequence a (possibly zero) number of gaps so that the resulting sequences are of the same length and each column has at least one character different from ‘-’ (gap). IMAGINABLE IMPRACTICABLE INFALLIBLE IM—-AG-INABLE IMPRACTICABLE IN-FALLI--BLE IM-—-AG-INABLE IM-PRACTICABLE IN--FALLI--BLE /course/eleg667-01-f/Topic-2c

How to score an alignment? The Sum-of-Pairs (SP) score: A multiple alignment implies a pair-wise alignment for each pair of sequences; SP defines the score of multiple alignment as the sum of scores of all implied pair-wise alignments. A A C G T A C G A T A A – C G T A – A A T G G T C G T A - - T T A match = 1 mismatch = 0 gap-character = -1 gap-gap = 0 5 3 4 SP score = 12 1 –2 3 3 3 3 –2 –2 1 3 1 = 12 Note: score (-,-) = 0 /course/eleg667-01-f/Topic-2c

MSA using dynamic programming If k sequences of size n then: O(nk) space and O(k22knk) time G C - C G - G - - C G T - G T - - - C G T - - G T A - - - A C G T - G - G T A - - - - A G A A T G nk cells 2k–1 calculations/cell k(k-1)/2 calculations to compute the SP-score C G T G 7 calculations/cell /course/eleg667-01-f/Topic-2c

Recall the pair-wise case ? 3 5 7 1 2 4 6 8 9 Question: from 1 to 9 how many paths? G A 0 -1 -2 A -1 0 0 G -2 0 0 1 3 5 2 8 6 9 7 4 Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? Answer: Let us count Total = 13 /course/eleg667-01-f/Topic-2c

Align Multiple Sequences Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? Answer: Let us count Total = 13 G A 0 -1 -2 A -1 0 0 G -2 0 0 3 5 7 1 2 4 6 8 9 Question: from 1 to 9 how many paths? 1 3 5 2 8 6 9 7 4 Align Multiple Sequences Assume we have 3 sequences: AG AC GC How to do DP? A G C Question: When DP comparison ends - how many possible distinct path have been explored in total? Answer: Count! /course/eleg667-01-f/Topic-2c

MSA Using DP with Heuristics How to cut down the search space (# of calculations) at each step? One way is to eliminate pairwise projections which does not contribute to the optimal alignment – develop such a test. /course/eleg667-01-f/Topic-2c

Other MSA Methods Using Heuristics Star Alignment: Build a multiple alignment based upon the pair-wise alignments between a fixed sequence – called the “center” of the input set and all others. Tree Alignment: Build a multiple alignment based upon the pair-wise alignments along edges of a tree relating all the sequences. /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Star Alignment Given k sequences Pick one of the sequences as the center Find optimal pair-wise alignments between the center sequence and each other sequence. Aggregate the pair-wise alignments (progressive alignment) /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Aggregate Step Using the center Sc as a guide Starting with one pairwise alignment, say Sc and S1, and aggregate the rest pairs one at a time When add one pair (Si, Sc) in, make sure we progressively increase the gaps in Sc to suit further alignment, never removing gaps. /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Star Alignment (cont.) How should we select the center sequence? Build a table with the pair-wise similarity score for each pair of sequences. Choose the sequence with the highest sum of scores. /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Star Alignment (cont.) S2 7 A T T G C C A T T A T G C 1 -1 -3 -5 -7 -9 -11 -13 -15 -2 -4 -6 -8 -10 -12 -14 -16 -18 2 3 4 5 6 S1 S1 = ATTGCCATT S2 = ATGGCCATT S3 = ATCCAATTTT S4 = ATCTTCTT S5 = ACTGACC S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 -2 0 -3 7 7 -2 0 -4 -2 –2 0 -7 0 0 0 -3 -3 -4 -7 -3 For k sequences, each size n Time =T1 = O((k.(k-1)/2).n2) =O(k2.n2) Score = 7 S1 = ATTGCCATT S2 = ATGGCCATT So S1 is picked as the center /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Star Alignment (cont.) S1 = ATTGCCATT S5 = ACTGACC-- S2 = ATGGCCATT S1 = ATTGCCATT-- S3 = ATC-CAATTTT S4 = ATCTTC-TT S1 S2 S3 S4 S5 S1 = ATTGCCATT S2 = ATGGCCATT S3 = ATC-CAATTTT S4 = ATCTTC-TT S5 = ACTGACC-- S1 = ATTGCCATT—- S2 = ATGGCCATT-- S3 = ATC-CAATTTT S4 = ATCTTC-TT-- S5 = ACTGACC---- For k sequences, each size n, and an upper bound on the alignment length of a: Time =T2 = O((k-1).n2 + (k-1)2.a ) T1+T2 = O(k2.n2 + k.n2 + (k-1)2.a) “Once a gap, always a gap” /course/eleg667-01-f/Topic-2c

Issues in Star Alignment How to select the best anchor ? How to determine the order of progression ? /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Tree Alignment Uses a clustering technique to order groups of related sequences in a hierarchical tree; Based on the tree hierarchy (order from leaves to root), the multiple sequence alignment is generated by aligning and combining groups of sequences; /course/eleg667-01-f/Topic-2c

The Basic Idea of Tree Alignment (a) A set of sequences S2 S5 S4 S2 S5 S7 S3 S6 S8 S1 S9 (a) A set of sequences (c) A pair-wise distance matrix S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Question: In general, given pair-wise distances between a set S of objects (e.g. distance matrix), how to derive a weighted tree T where each leaf of T corresponds to an object in S, and the distance between two leafs i, j correspond to the distance between i and j in S? Answer: This problem is an important problem in computation biology, and has been studied by many authors using a variable of techniques. /course/eleg667-01-f/Topic-2c

Clustal W – A Tool of Progressive Multiple Sequence Alignment with Improved Sensitivity

/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) All pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; A guide tree is calculated from the distance matrix; The sequences are progressively aligned according to the branching order in the guide tree. /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) S1 S3 S2 S4 Guide Tree S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 D12D13 D14 D23 D24 D34 Distance matrix /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) S2 Align most similar pair S4 Guide Tree gaps to optimize alignment S1 Align next most similar pair S3 S2 S1 S3 S2 S4 S4 Align alignments, preserve gaps S1 S3 new gap to optimize alignment of (S1S3)with (S2S4 ) /course/eleg667-01-f/Topic-2c

Clastal-W: Some Implementation Hints /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Distance Matrix Initially all sequences are pairwised aligned. S1 S2 S3 S4 S5 S6 S7 S2 S3 S4 S5 S6 S7 S8 S1 S2 Sn S3 S1 S2 = 7 7 S3 S1 = 8 8 17 14 11 12 10 13 8 5 11 8 5 13 10 7 8 16 13 10 11 5 13 10 7 8 6 9 /course/eleg667-01-f/Topic-2c

Two Options for Pairwise Alignment Fast approximate method (Bashford,D.,Chothia.,C., 1987,J.Mol.Biol.) Allows large number of seqs to be aligned even on a microcomputer Fully dynamic programming alignments (Myers,E.,Miller,W., 1988,CABIOS) Two gap penalties Full weight matrix /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c The Guide Tree Unrooted tree Calculated from distance matrix ( Neighbour-Joining Method ) Rooted tree Calculated from unrooted tree ( Middle Point Method ) /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Unrooted Tree Neighbor Joining Method provides not only the topology but also the branch lengths (Fitch, Margoliash) of the final tree Each node represents a sequence Each path length represents the distance between two specific sequences /course/eleg667-01-f/Topic-2c

Unrooted Tree - Example S1 S4 L1 S5 L4 L5 A E B L2 L8 C D S2 L3 S8 L6 L7 F S3 S7 S6 /course/eleg667-01-f/Topic-2c

Neighbour Joining Method S8 S7 S1 S7 S1 S8 S6 S2 S6 X X Y S2 S3 S5 S3 S5 S4 S4 S12 = Sum of all branch lengths = f(D’s) /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c NJ-Method Example /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c PSI-BLAST Observation Database searches using position-specific score matrices, also called profiles or motifs, often are much better able to detect weak relationships than are database searches that use a simple sequences as query /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c PSI-BLAST Cont’d PSI-BLAST uses a procedure to contruct a position-specific score matrix automatically from the output of a BLAST run, and modified BLAST to operate using such a matrix in the place of a simple query The resulting PSI-BLAST program often is substantially more sensitive than the corresponding BLAST program. /course/eleg667-01-f/Topic-2c

PSI-BLAST and Multiple Sequence Alignment PSI-BLAST also produce a multiple sequence alignment with the query sequence as a master template Collect all hits with E-value below a theshold-say 0.01, and Do not include copies of sequences identical to the query Retain one copy for each hit which is very similar to the query Other details The MSA constructed is used by PSI-BLAST for construction a scoring matrix /course/eleg667-01-f/Topic-2c

Where PSI-BLAST Differ from Other “True” MSA Methods? PSI-BLAST deals with local alignments, so each columns of M (the multiple alignment) may involve varying numbers of sequences. In fact, some columns may include only the query sequence itself. /course/eleg667-01-f/Topic-2c

Classification of Multiple Sequence Alignment Methods MSA Progressive Iterative (local) Global Alignment Local Alignment DALIGN PIMA HMM (HMMT) STAR Tree Genetic Algorithm (SAGA) MULTAL MULTALIGN PILEUP CIUSTA-W PSI-BLAST /course/eleg667-01-f/Topic-2c

How to Compare Alignment Software ?

/course/eleg667-01-f/Topic-2c CASA -- A Server for the Critical Asessment of Protein Sequence Alignment Accuracy Sequence alignment Structural alignment Fasta proteins database User 1 H i g h S p e e d N e t w o r k Benchmark Server CE alignments Web Interactive Benchmarking Program User 4 User 3 User 2 ASVIE-AAVI VIVI-EPAAG A-SVIE-AAV- VIVI-EPAAG Remote users Download fasta sequences Produce set of sequence alignment Submit the resulted alignments Benchmarking program evaluates parameters /course/eleg667-01-f/Topic-2c

/course/eleg667-01-f/Topic-2c Methods of Discovery of Biological Sequence Homology Alignment Pattern Matching Pair wise MSA Eventuation All and verify Scan Seeds And ??? Combined Optimal Heuristic Heuristic FLASH Global Local FAST BLAST Progressive Iterative MOTIF/ ASSET (See slide 41) Discover DP PRATT TEIRESIAS ATGC /course/eleg667-01-f/Topic-2c