Approximate logarithmic gap penalty Affine gap functions are linear in gap length, γ(x) = αx + β. Logarithmic gaps handle both problems by penalizing small.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Profiles for Sequences
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Heuristic alignment algorithms and cost matrices
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
1 Protein Multiple Alignment by Konstantin Davydov.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
It & Health 2009 Summary Thomas Nordahl Petersen.
CSE182-L12 Gene Finding.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Introduction to bioinformatics
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Lecture 12 Splicing and gene prediction in eukaryotes
Sequencing a genome and Basic Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Protein Sequence Alignment and Database Searching.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Sequencing a genome and Basic Sequence Alignment
Secondary structure prediction
Construction of Substitution Matrices
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Protein Sequence Alignment Multiple Sequence Alignment
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Alignment.
Ab initio gene prediction
More on translation.
Pair Hidden Markov Model
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Approximate logarithmic gap penalty Affine gap functions are linear in gap length, γ(x) = αx + β. Logarithmic gaps handle both problems by penalizing small indels linearly but reducing the additional penalty as gaps grow larger. While true logarithmic gap penalties increase alignment runtime by a factor of the log of the length of the larger sequence, we may approximate them using multiple affine penalties. Probabilistic Global Alignment of DNA Under Multiple Conservation Models Probabilistic Global Alignment of DNA Under Multiple Conservation Models Chuong B. Do, Michael Brudno, Serafim Batzoglou Department of Computer Science, Stanford University, California, USA {chuongdo, brudno, Comparative analysis of genomic sequences from different organisms has long been considered a powerful method for inferring biological function. Most standard global nucleotide aligners use some variation of or approximation to the basic Needleman-Wunsch algorithm for modeling evolutionary changes. When organisms have sufficiently diverged, however, simple DNA-level conservation sometimes provides too weak a signal for reliable comparisons, leading to inaccurate alignments. Better results for distantly related species can be achieved by combining several different alignment models to account for the local characteristics of regions of sequences during alignment. We present LAGAN2, a multi-state probabilistic extension of the LAGAN alignment method for pairwise sequence comparison. 1 Why another aligner? Given a pair of sequences X and Y: Step OneGenerate local alignments between X and Y. Step TwoConstruct rough global map by O(n log n) chaining. Step ThreeCompute global alignment restricted to areas surrounding chained alignments. 2 LAGAN alignment in a nutshell Multiple Conservation Models M GXGX GYGY KEY Mmatch G X gap in sequence X G Y gap in sequence Y CONSERVED NONCODING CC-M CC-G Y CC-G X CC-M CC-G Y CC-G X CN-M CN-G Y CN-G X CN-G Y ’ CN-G X ’ N-M/G CONSERVED CODING FORWARD NONCONSERVED CONSERVED CODING REVERSE Traditional dynamic programming-based alignment methods (such as LAGAN) maximize the score for aligning sequences under a scoring matrix for nucleotide matches and an affine gap penalty, as depicted in the finite-state machine above. While such techniques fail to take advantage of various biological signals that provide stronger signals for alignment, the computational time cost of sophisticated but more biologically- motivated alignment models has generally been prohibitive for processing sequences with lengths over 1 megabase. In LAGAN2, we introduce an extended model for more accurate nucleotide-level pairwise alignment of such long genomic DNA. Amino acid level conservation Met Glu Val Leu Phe Tyr Ser Asp ATG GAG GTG CTG TTC TAT TCA GAT ATG GAA GTC CTC --- TAC AGC GAC Met Glu Val Leu Tyr Ser Asp While the correct alignment is clear in the translated sequences, low nucleotide similarity (see serine) poses a challenge for traditional aligners. Also, note how the ending TC of the leucine in the sequence are conserved with an in-frame following gap, whereas regular aligners would choose to move the TC to right, generating an out-of-frame gap. αβWhen α is _____ and β is ____, then… HighLowThe few large gaps introduced correctly align long conserved stretches while small indels are missed. LowHighSmall insertions/deletions are modeled correctly, but overall alignment quality drops as larger broad areas of conservation are broken into small nonorthologous regions of misleadingly high nucleotide identity. KEY N-M/Gnonconserved match/gap CN-Mconserved noncoding match CN-G X conserved noncoding short gap in X CN-G Y conserved noncoding short gap in Y CN-G X ’conserved noncoding long gap in X CN-G Y ’conserved noncoding long gap in Y CC-Mconserved coding forward strand match CC-G X conserved coding forward strand gap in X CC-G Y conserved coding forward strand gap in Y CC-Mconserved coding reverse strand match CC-G X conserved coding reverse strand gap in X CC-G Y conserved coding reverse strand gap in Y γ(x) x 3 Classical alignment models The Nonconserved State The nonconserved state allows us to model regions in which the low or nonexistent conservation fails to produce a meaningful alignment; with such a state, we can prevent overprediction as well as suggest boundaries for local conservation. The systems were tested on sequence from the cystic fibrosis transmembrane conductance regulator region for 12 species. Pairwise alignments were performed between the human sequence and each of the other 11 species, including baboon, cat, chicken, chimp, cow, dog, fugu, pig, rat, zebrafish. Representative species are used to illustrate the distribution of predicted states per nucleotide. Despite the low state frequencies for conserved coding states, exon alignment accuracy remained high in all species as measured by percentage of exon length covered in the alignments. 5 Results Table 1. Exon coverage and nucleotide-level prediction accuracy CoveragePrediction Accuracy Organism< 70% Correct 70%+ Correct 90%+ Correct 100% Correct Exon Sensitivity Exon Specificity Fugu5%99%87%81%86.7%63.5% Zebrafish0%100%98%94%97.6%57.4% Chicken0%100% 99%88.1%79.6% Table 2. Representative State Distributions OrganismNonconservedConserved Noncoding Conserved Coding Fugu99.081%0.053%0.867% Chicken99.080%0.553%0.368% Mouse59.834%39.321%0.845% Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Research, 2003 Apr; 13(4): Kent WJ, Zahler AM. Conservation, Regulation, Synteny, and Introns in a Large-Scale C. briggsae-C. elegans Genomic Alignment. Genome Research, 2000 Aug; 10(8): Miller W, Myers EW. Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, (2): Thanks to Eugene Davydov and Marina Sirota for useful conversations. 6 References