Needleman-Wunsch with affine gaps

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
COFFEE: an objective function for multiple sequence alignments
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
BNFO 602 Multiple sequence alignment Usman Roshan.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence Analysis Tools
Multiple alignment: heuristics
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
BNFO 602 Multiple sequence alignment Usman Roshan.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Expected accuracy sequence alignment Usman Roshan.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
Introduction to Bioinformatics
Presentation transcript:

Needleman-Wunsch with affine gaps Gap scores: g(g)=-d-(g-1)e where d=2, e=1 Precedence: M, Ix, Iy PAM 250 A C D 2 -2 12 -5 4 Align the sequences: CA and DC

Multiple sequence alignment Biology 162 Computational Genetics Todd Vision 2 Sep 2004

Preview How to score a multiple alignment Sum of pairs scores Weighting Generalizing pairwise alignment algorithms Full dynamic programming Carillo-Lipman Practical methods Progressive Iterative Stochastic Probabilistic Some final thoughts

Multiple sequence alignment (MSA)

Mind the gaps Trivial Difficult

Natural score Tree score Even with a known tree, finding an MSA to optimize the tree score is NP-hard A SAE B E SBD SDE D SCD C

Star-tree scores Assume an unresolved phylogeny Sum-of-pairs () Entropy Consistency Weighs agreement with external evidence

Entropy as used in a sequence logo

SP scores: pros and cons Easy, intuitive, work OK Cons Substitution scores based on pairs of residues Inconsistent behavior with k One mismatch matters more when k is large than when k is small Gap penalties undefined for s(-,-)

Natural gap penalties Gap costs in multiple alignment should be equal to sum of gap costs in induced pairwise alignments Computationally prohibitive to compute for most algorithms Instead, quasi-natural gap costs are computed They are almost always identical

Weighted SP scores Scores are not independent due to (unaccounted for) shared ancestry To correct this, sum-of-pairs scores from related sequences can be down-weighted Variety of weighting schemes exist Tree-based weighting is simplest Assign weights proportional to sum of branch lengths on a phylogenetic tree Obviously requires a tree (but we have an approximate tree in some algorithms)

Full dynamic programming We have k sequences of length n Recursion equations are similar to pairwise case We can use a simple extension of pairwise scoring As before, we can guarantee an optimal alignment The problem is we must fill out a k-dimensional hypercube Time and space grow exponentially in k At least O(k22knk) Computationally prohibitive even for a moderate number of short sequences

Carillo-Lipman algorithm Reduce volume of hypercube that is searched Upper bound on score Score of optimal MSA is less than or equal to sum of scores of optimal pairwise alignments Lower bound on score Score of optimal MSA must be greater or equal to score of heuristic MSA Projections in each dimension defined by optimal pairwise alignments and induced heuristic alignments Optimum path is bounded by projections in all dimensions

Carillo-Lipman algorithm

Carillo-Lipman algorithm Only works for SP scoring function Implemented in MSA software Can still only tackle small cases (eg 15 sequences of length 300)

Practical global alignment methods Progressive Uses a guide tree to reduce the problem to multiple pairwise alignments Iterative Initialized with a fast multiple alignment Sequences are randomly partitioned and pairwise aligned until convergence Stochastic Genetic algorithms as an example Probabilistic Hidden Markov models

Progressive Alignment Fast, but no guarantee of finding the optimum Implementations: Feng-Doolittle, ClustalW, Pileup Steps Compute all k(k-1)/2 pairwise alignments Use alignment scores to construct guide tree Perform pairwise alignments beginning at the leaves of the guide tree and working toward the root

Pairwise score matrix S12 S13 S14 S15 S23 S24 S25 S34 S35 S45 Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sequence 5 S12 S13 S14 S15 S23 S24 S25 S34 S35 S45

Guide Tree Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sequence 5 2 4

New Problem How to align a sequence to an alignment? Or two alignments to each other? Feng-Doolittle solution Choose highest scoring pair of sequences between the two groups to guide the alignment ClustalW solution Profile alignment: compute generalized sum of pairs score

Profiles Profile II Profile I 1 2 3 4 1 2 3 4 ---------- ---------- 1 2 3 4 ---------- a w w w w pos c w w w w g w w w w t w w w w S 1 1 1 1 Profile I 1 2 3 4 ---------- a w w w w pos c w w w w g w w w w t w w w w S 1 1 1 1

ClustalW- ad hoc improvements Variable substitution matrix Encourage gaps preferentially in structural loops Residue-specific gap penalties Reduced penalties in hydrophilic regions Reduced gap penalties in positions already containing gaps Increased gap opening penalties in flanking sequence of gap

Progressive alignment: major weakness Errors introduced in the alignment of subgroups are propagated through all subsequent steps There is no provision for correcting such errors once they happen Local optimum versus global optimum

Iterative alignment Again capitalizes on the ease of pairwise alignment between groups of sequences Allows for gaps to be removed and positions to be shifted in each iteration Some algorithms guarantee convergence given long enough Can be several orders of magnitude slower than progressive methods Most successful implementation: PRRN

Iterative alignment ACGATAGACAT ACG-TACAGAT ACGATAGACAT ACGATAGACAT CGA-TAGAGAC CGA-TACAGAC ACGATAGACAT ACG-TACAGAT -CGATAGAGAC -CGATACAGAC CGA-TAGAGAC CGA-TACAGAC

T-COFFEE Uses consistency as an objective function Evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignment The consistency function can also incorporate extraneous information (such as structural constraints) Among the most successful of approaches when % identity is moderate to good

Dialign A multiple local alignment algorithm Informally, it works by chaining together ungapped segments from dotplots Does not explicitly score gaps at all May contain unaligned regions flanked by aligned regions

Stochastic methods Genetic algorithms (eg SAGA) Initalize with population of heuristic alignments Evaluate ‘fitness’ of individual alignments Can employ computationally intensive scoring functions Create new generation of alignments Select parents according to fitness ‘Cross-over’ attributes of parents Apply mutation to perturb progeny alignments Return to ‘evaluate fitness’ step Stopping rule

Probabilistic methods Hidden Markov Models Models that generate MSAs Many parameters to fit Probability of each residue in each column Probability of entering gap states between columns Perform poorly on unaligned sequences But are commonly used in signature databases Perform well for finding matches to already aligned sequences Efficient algorithms exist for aligning sequences to HMMs

Hidden Markov model

How do you know when you’ve got the right answer? Short answer: you don’t. Structural superposition typically used to evaluate methodologies BAliBASE: database of curated reference alignments

Comparison of test and reference alignments Modified SP score Frequency with which pairs of residues aligned in test are aligned in reference Column score Frequency with which entire columns of residues are aligned in both test and reference

Be skeptical! MSA is a hard problem Computationally Biologically There is no ‘one size fits all’ algorithm No two algorithms need agree

The future of MSA Chances are your new sequence matches something already in the database It may soon be a rarity to generate an MSA from scratch Signature databases currently allow local alignment of a query to a pre-existing local multiple alignment (eg InterProScan)

Summary Challenges in MSA How MSA is achieved in practice Even bounded dynamic programming is impractical Appropriate scoring is not obvious How MSA is achieved in practice Fastest Progressive pairwise alignment Slower Iterative alignment Stochastic alignment Automated MSAs require manual scrutiny

Reading Assignment Pertsemlidis A, Fondon JW (2002) Having a BLAST with bioinformatics (and avoiding BLASTphemy), 10 pgs. http://genomebiology.com/2001/2/10/reviews/2002.1

Reading Assignment Gusfield D (1997) pgs. 376-381 in Algorithms on Strings, Trees and Sequences. Durbin et al. (1998) pgs. 36, 38-41 in Biological Sequence Analysis.