Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Bioinformatics Algorithms and Data Structures
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Sequence similarity.
Multiple alignment: heuristics
Multiple sequence alignment
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Phylogenetic trees Sushmita Roy BMI/CS 576
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise & Multiple sequence alignments
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Calculating branch lengths from distances. ABC A B C----- a b c.
Evolutionary tree reconstruction
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Expected accuracy sequence alignment Usman Roshan.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Multiple Sequence Alignment
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
Computational Genomics Lecture #3a
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004

02/19/ Importance? RNA folding (Trifonov, Bolshoi) Gene regulation (Galas et al.) Protein structure-function relationships (Wu, Kabat) Molecular evolution (Dayhoff)

02/19/ Introduction Original sequence unknown – Must consider all possible transformations – Including insertions, deletions, and replacements Choose the most likely set of transformations – With a given model of protein evolution

02/19/ Sequences and Alignments An alignment of the sequences is written as K-sequence: sequence of k characters Each is obtained from – Blanks are inserted in positions where some of the other sequences have a nonblank character – At least one must be nonblank for each is the length of the aligned sequences

02/19/ Alignments D Q L F D N V Q Q G L D - - Q – L F D N V Q Q G L - Ex: sequences DQLF, DNVQ, QGL

02/19/ Lattices and Paths – Cartesian product of strings of squares A path between the sequences is a set of connected line segments (connected broken line) A lattice of sequences with lengths n – Consists of -dimensional hypercubes – Forms an -dimensional parallelepiped

02/19/ Paths 2 dimensions3 dimensions 3 possible paths 7 possible paths = 2 n -1 = O(2 n )

02/19/ Paths DQ G L NVQ D Q L F 3-dimensional parallelepiped sublattice Sequences DQLF, DNVQ, QGL DD-DD- -N--N- QQQQQQ --G--G L-LL-L F--F-- -V--V-

02/19/ Sequences: ABCD, ABD, BCD Paths and Sequence Length Note: – Where is the length of A B C D A B – D - B C D ABCD A B D B C D

02/19/ Sequences: ABCD, EFGH, IJK Paths and Sequence Length Note: – Where is the length of EI J K FGH A B C D A B C D – E F G H I J K

02/19/ Sequences DQLF, DNVQ, QGL Projections DQ G L NVQ D Q L F denotes an alignment of and D Q – L F - Q G L - DQLF Q G L

02/19/ Optimal Paths is a measure assigned to – Measure of the similarity among based upon a particular metric For each measure there is at least one path with attaining a minimum value at, the optimal path

02/19/ DQ G L NVQ D Q L F Each vertex in L is an end corner of the sublattice Calculating Optimal Paths First: compute score of each of the possible paths for the cube that has a vertex at the original corner Next: using this information, compute minimum score to reach the vertices of the adjacent cubes to the original corner

02/19/ Problems with This Algorithm Calculates a weighted sum of its projected pairwise alignments – Called “Sum-of-the-Pairs” (SP) Other methods fit biological intuition more closely

02/19/ Tree-Alignment Treat sequences as leaves of an evolutionary tree Reconstruct ancestral sequences which minimize the cost of the tree – Must assign sequences to internal nodes Align the given and reconstructed sequences Star-alignment: only one internal node

02/19/ Tree-Alignment Many different methods for calculating tree alignments Discuss version used by ClustalX

02/19/ Tree-Alignment in ClustalX Three main parts 1. Perform pairwise alignment on all sequences to calculate a distance matrix 2. Use distance matrix to calculate a guide tree 3. Sequences are progressively aligned using the branching order in the guide tree

02/19/ Calculating Distance Matrix Use standard dynamic programming to find the best alignment – Gap penalties for opening a gap and continuing a gap (possibly different) Divide number of matches by total number of residues compared (excluding gaps) Convert to distances by dividing by 100 and subtracting from 1 Gives one entry in the n by n matrix

02/19/ Calculating Distance Matrix Ex: sequences ATCG, ATCC, AGGC, AGCC A T C G A T C C = 3/4 =.75/100 = =.9925 A T C G A G G C = 1/4 =.25/100 = =.9975

02/19/ Calculating Distance Matrix ATCGATCTAGGCGCAA ATCG-- ATCT AGGC GCAA111--

02/19/ Calculating a Guide Tree Using Nearest-Neighbor method to group sequences – Results in an unrooted tree – Branch lengths proportional to estimated divergence “Mid-point” method used to determine root – Means of the branch lengths to each side of the root are equal (or approximately equal)

02/19/ Calculating a Guide Tree ATCG ATCT ATCG AGGC AGCC GCAA AGAA / /31 ATCG = ATCT = AGGC = GCAA = 1

02/19/ Calculating a Guide Tree ATCG = ATCT = AGGC = GCAA = ATCG ATCT ATCG AGGC AGCC GCAA AGAA /2

02/19/ Progressive Alignment Perform a series of pairwise alignments – Slowly align larger and larger groups of sequences Follow the branching order of the tree – From leaves to root

02/19/ Progressive Alignment ATCG ATCT ATCG AGCC AGGCGCAA AGAA

02/19/ Alignment Costs AC A A C A, A, A, C, C -- 6 A A A A A C C C A, A, A, C, C A, A, C 1 C C A A A A A, A, A, C, C A 2 Traditional Input seq Reconstructed seq Missmatches Traditional (SP)Tree-AlignmentStar-Alignment

02/19/ Alignment Inconsistencies Different definitions of multiple alignments can yield different optimal alignments Optimal tree-alignments minimize number of mutations from theorized common ancestors SP-alignments maximize number of positions where aligned sequences agree – Sometimes makes more biological sense since certain regions of proteins more likely to mutate

02/19/ Alignment Inconsistencies Ex: cost of 1 for aligning two different letters, cost of 2 for aligning a letter with a null Sequences: ACC, ACC, TCT, ATCT Input sequences Reconstructed sequences - A C C - A C C - T C T A T C T -- Traditional (SP) A C C - A C C - T C T - A T C T A C C - Star-Alignment

02/19/ ClustalX Demo Multiple sequence alignment program For more information on ClustalX – stalx.htm