Multiple Sequence Alignment (II)

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Advertisements

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Multiple Sequence Alignment
COFFEE: an objective function for multiple sequence alignments
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.
Introduction to Bioinformatics Algorithms Multiple Alignment.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Sequence Analysis Tools
Multiple alignment: heuristics
Multiple sequence alignment
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Introduction to Bioinformatics Algorithms Multiple Alignment.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
SMA5422: Special Topics in Biotechnology
CSE 5290: Algorithms for Bioinformatics Fall 2011
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment (cont.)
Multiple Alignment.
Multiple Sequence Alignment (I)
Multiple Sequence Alignment
Multiple Sequence Alignment
Presentation transcript:

Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

How can we help biologists do multi-sequence alignment? Dynamic programming for multi-sequence alignment gives an exact solution, but it’s computationally expensive How can we help biologists do multi-sequence alignment?

When finding an exact solution is computationally too expensive, we explore how to find an approximate solution… So, how do we find a good approximation of the optimal multi-sequence alignment?

Inferring Multiple Alignment from Pairwise Alignments From an optimal multiple alignment, we can infer pairwise alignments between all sequences, but they are not necessarily optimal It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences

Combining Optimal Pairwise Alignments into Multiple Alignment Can combine pairwise alignments into multiple alignment Can not combine pairwise alignments into multiple alignment

Inferring Pairwise Alignments 3 sequences, 3 comparisons 4 sequences, 6 comparisons 5 sequences, 10 comparisons

Multiple Alignment: Greedy Approach Choose most similar pair of strings and combine into a consensus, reducing alignment of k sequences to an alignment of of k-1 “sequences”. Repeat This is a heuristic greedy method u1= AC-TAC-TAC-T… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k

Greedy Approach: Example Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC

Greedy Approach: Example (cont’d) There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)

Greedy Approach: Example (cont’d) s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCTGA (consensus) There are many (4) alternative choices for the consensus, let’s assume we randomly choose one new set becomes: s1 GATTCA s3 GATATT s2,4 GTCTGA

Greedy Approach: Example (cont’d) scores are: set is: s1 GAT-TCA s3 GATAT-T (score = 1) s1 GATTCA s3 GATATT s2,4 GTCTGA s1 GATTC--A s2,4 G-T-CTGA (score = 0) s3 GATATT- s2,4 G-TCTGA (score=-1) Take best pair and form another consensus: s1,3 = GATATT (arbitrarily break ties)

Greedy Approach: Example (cont’d) scores is: new set is: s1,3 GATATT s2,4 G-TCTGA (score=-1) s1,3 GATATT s2,4 GTCTGA Form consensus: s1,3,2,4 = GATCTG (arbitrarily break ties)

Progressive Alignment Progressive alignment is a variation of greedy algorithms with a somewhat more intelligent strategy for scheduling the merges Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Simplified representation of the alignments Better solution? Use a profile to represent consensus A 3 2 1 T G C ATG-CAA AT-CCA- ACG-CTG Hidden Markov Models (HMMs) capture such a pattern…

Feng-Doolittle Progressive Alignment Step 1: Compute all possible pairwise alignments Step 2: Convert alignment scores to distances Step 3: Construct a “guide tree” by clustering Step 4: Progressive alignment based on the guide tree (bottom up) Note that variations are possible at each step!

Feng-Doolittle: Clustering Example Similarity matrix (from pairwise alignment) Length normalization Convert score to distance X1 X2 X3 X4 X5 X1 15 11 3 4 30 5 3 1 5 25 12 11 3 4 12 40 9 4 1 11 9 30 X2 X3 X4 4.5 X5 Guide tree 10 X5 X3 X1 X2 X4 15 12 X1 X2 X3 X4 X5

Feng-Doolittle: How to generate a multiple alignment? At each step, follow the guide tree and consider all possible pairwise alignments of sequences in the two candidate groups ( 3 cases): Sequence vs. sequence Sequence vs. group (the best matching sequence in the group determines the alignment) group vs. group (the best matching pair of sequences determines the alignment) “Once a gap, always a gap” gap is replaced by a neutral symbol X X can be matched with any symbol, including a gap without penalty

Generating a Multi-Sequence Alignment Align the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary Sample output: FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **: Dots and stars show how well-conserved a column is.

Problems with Feng-Doolittle All alignments are completely determined by pairwise sequence alignment (restricted search space) No backtracking (subalignment is “frozen”) No way to correct an early mistake Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process  Profile alignment  Iterative refinement

Profile Alignment Aligning two alignments/profiles Treat each alignment as “frozen” Alignment them with a possible “column gap” Fixed for any two given alignments Only need to optimize this part

Iterative Refinement Re-assigning a sequence to a different cluster/profile Repeatedly do this for a fixed number of times or until the score converges Essentially to enlarge the search space

ClustalW: A Multiple Alignment Tool Essentially following Feng-Doolittle Do pairwise alignment (dynamic programming) Do score conversion/normalization (Kimura’s model, not covered) Construct a guide tree (neighbour-journing clustering, will be covered later) Progressively align all sequences using profile alignment

ClustalW Heuristics Avoid penalizing minority sequences Sequence weighting Consider “evolution time” (using different sub. Matrices) More reasonable gap penalty, e.g., Depends on the actual residues at or around the positions (e.g., hydrophobic residues give higher gap penalty) Increase the gap penalty if it’s near a well-conserved region (e.g., perfectly aligned column) Postpone low-score alignment until more profile information is available.

Heuristic 1: Sequence Weighting Motivation: address sample bias Idea: Down weighting sequences that are very similar to other sequences Each sequence gets a weight Scoring based on weights Score for one column w1: peeksavtal w2: peeksavlal w3:egewglvlhv w4:aaektkirsa Sequence weighting

Heuristic 2: Sophisticated Gap Weighting Initially, GOP: “gap open penalty” GEP: “gap extension penalty” Adjusted gap penalty Dependence on the weight matrix Dependence on the similarity of sequences Dependence on lengths of the sequences Dependence on the difference in the lengths of the sequences Position-specific gap penalties Lowered gap penalties at existing gaps Increased gap penalties near existing gaps Reduced gap penalties in hydrophilic stretches Residue-specific penalties

Gap Adjustment Heuristics Weight matrix: Gap penalties should be comparable with weights Similarity of sequences GOP should be larger for closely related sequences Sequence length Long sequences tend to have higher scores Difference in sequence lengths Avoid too many gaps in the short sequence GOP = {GOP+log[min(N,M)]}* (avg residue mismatch score) * (percent identity scaling factor) N, M = sequence lengths GEP = GEP *[1.0+|log(N/M)|] N>M

Gap Adjustment Heuristics (cont.) Position-specific gap penalties Lowered gap penalties at existing gaps Increased gap penalties near existing gaps Reduced gap penalties in hydrophilic stretches (5 AAs) Residue-specific penalties (specified in a table) GOP = GOP * 0.3 *(no. of sequences without a gap/no. of sequences) GOP = GOP * {2+[(8-distance from gap) *2]/8} GOP = GOP * 1/3 If no gaps, and one sequence has a hydrophilic stretch GOP = GOP * avgFactor If no gaps and no hydrophilic stretch. Average over all the residues at the position

Heuristic 3: Delayed Alignment of ‘ Divergent Sequences Divergence measure: Average percentage of identity with any other sequence Apply a threshold (e.g., 40% identity) to detect divergent sequences(“outliers”) Postpone the alignment of divergent sequences until all of the rest have been aligned

What You Should Know Basic steps involved in progressive alignment Major heuristics used in progressive alignment Why a progressive alignment algorithm is not optimal