The ideal approach is simultaneous alignment and tree estimation.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Structural bioinformatics
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Pairwise alignment Computational Genomics and Proteomics.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
MCB 5472 Lecture #6: Sequence alignment March 27, 2014.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Biology 4900 Biocomputing.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Xuhua Xia Sequence Alignment Xuhua Xia
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Xuhua Xia
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
Sequence comparison: Local alignment
Dot Plots, Path Matrices, Score Matrices
Sequence Alignment 11/24/2018.
In Bioinformatics use a computational method - Dynamic Programming.
Pairwise Sequence Alignment
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
A T C.
BIOINFORMATICS Sequence Comparison
Basic Local Alignment Search Tool (BLAST)
Multiple Sequence Alignment
Presentation transcript:

The ideal approach is simultaneous alignment and tree estimation. Lecture 5 – Alignment The ideal approach is simultaneous alignment and tree estimation. 1 ACCGAAT--ATTAGGCTC 2 AC---AT--AGTGGGATC 3 AA---AGGCATTAGGATC 4 GA---AGGCATTAGCATC 5 CACGAAGGCATTGGGCTC It looks like there have be two insertion/deletion events, one at positions 3-5 & one at positions 8-9. So really, figuring out where insertions and deletions have occurred requires a historical framework (i.e., a phylogeny), which (at least in this class) is the motivation for collecting sequence data in the first place.

We need a way to evaluate alternative alignments A simple score: D = s + w g Where s is the score for match/mismatch, g is a gap cost and w is a gap weight. Goal: find the globally optimal combination of topology + alignment However, given that phylogenetic estimation from a single alignment is a problem that is so computationally difficult as to require heuristic methods (short cuts), simultaneous alignment is really not feasible. The most commonly use heuristic in MSA is progressive alignment.

Progressive Alignment Overview. Progressive alignment occurs in three steps. 1. A pair-wise alignment is generated for all (n2-n)/2 pairs of sequences. A pair- wise distance is estimated between each of the pairs and this is entered into a distance matrix. 2. That distance matrix is subjected to an algorithmic tree building procedure (usually NJ). 3. This guide tree is used to align progressively more distant pairs of sequences, until all sequences are in the alignment.

Pair-wise alignments are conducted using Needleman-Wunsch (1970) algorithm G A A T T C A G T T A (#1) G G A T C G A (#2) So L1 = 11 and L2 = 7 D = s + w g Si,j = 1 if the nucleotide at position i of sequence #1 is the same as that at position j of sequence #2 Si,j = 0 (mismatch score) g = 0 (gap penalty) w = 1 (gap weight)

Needleman-Wunsch Generate a matrix with the two sequences on each axis (11x7) We initialize it with a row and column of zeroes (gap cost). Mi-1, j-1 + Si,j (match/mismatch in the diagonal) 0+1=1 Mi, j-1 + wg (gap in sequence 1) 0+0=0 Mi-1, j + wg (gap in sequence 2) 0+0=0 The matrix is filled by finding the maximal score Mi,j for each cell. Mi,j = Maximum value of:

Needleman-Wunsch 1 1 1 1 We fill in the entire matrix with the Mi,j for each cell.

Needleman-Wunsch G | - G A | A - T | T - C | A - G | T - T - A | We trace back from the bottom right, following the path with the highest score. D = 6 All paths with a score of 6 represent optimal alignments. All (n2-n)/2 pairwise comparisons – Edit distances are used to form a pairwise matrix

This matrix is subjected to an quick tree-building algorithm (NJ). Again, following the guide tree, we align sequence 3 to the fixed 1/2 alignment. 1 ACCGAAT--ATTAGGCTC 2 AC---AT--AGTGGGATC 3 AA---AGGCATTAGGATC 4 GA---AGGCATTAGCATC 5 CACGAAGGCATTGGGCTC Now, we have a series of positional homologies & we can begin phylogeny estimation.

Elaborations: A substitution matrix is used to calculate the scores for mismatches. Biochemically similar amino acids v. radical changes. Transitions cost less than transversions. Gap costs can be elaborated so that there is a separate cost associated with opening a new gap (GOP) vs. extending a gap (GEP). Each sequence can be weighted according to how different it is from the other sequences. This accounts for the case where one specific subfamily is over represented in the data set. Position-specific gap-open penalties are modified according to residue type, using empirical observations in a set of alignments based on 3D structures.

alignments to generate an initial distance matrix. The largest portion of time is invested in conducting all the pair-wise N-W alignments to generate an initial distance matrix. Use k-mer words (usually k = 6) to derive a distance. Muscle (Edgar. 2004. NAR.32:1792)

F. Models in Alignment One problem is that the final alignment is dependent on the parameters, and the optimum parameter values depend on unknown processes of evolution (i.e., insertion rate, deletion rate, and rates of various substitution types). The Catch-22 is that we need a good tree to estimate these. This is because, as discussed in the beginning of this lecture, alignment and historical information are not independent of each other. There have been two approaches: One is iterative approaches (SATe II; Liu et al. 2012. Syst. Biol. 61:90) Bundles MAFFT, Muscle, & RAxML Also prunes subtrees in intermediate iterations.

Simultaneous Alignment BAliPhy: Suchard M A , Redelings B D. 2006. Bioinformatics, 22:2047-2048 This is a parametric approach that explores a vast state space: ω = (Y,A,τ,T,Θ,Λ). Y – The data (unaligned sequences) A – The alignment Θ – The substitution model = s + w(g) τ - The tree Λ – The indel model T – Its branch lengths