Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Longest Common Subsequence
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Hidden Markov Model in Biological Sequence Analysis – Part 2
Analysis of Algorithms
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Sabegh Singh Virdi ASC Processor Group Computer Science Department
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 12 Splicing and gene prediction in eukaryotes
Biological Motivation Gene Finding in Eukaryotic Genomes
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
On The Connections Between Sorting Permutations By Interchanges and Generalized Swap Matching Joint work of: Amihood Amir, Gary Benson, Avivit Levy, Ely.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Stringology 2004 CRI, Haifa Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
Core String Edits, Alignments, and Dynamic Programming.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
13 Text Processing Hongfei Yan June 1, 2016.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition Alignment Gary Benszon Departments of Computer Science and Biology Boston University

Outline of Talk 1.Sequence composition and composition match 2.Composition alignment algorithm 3.Composition match scoring functions 4.Growth of local composition alignment scores 5.Limiting the length of a composition match 6.Biological examples

Goal Identify features in DNA sequences that are not accurately described by position specific patterns. A position specific pattern, P, has the form: P = p 1 p 2 p 3... p k where p i is either a single specific character or a choice (weighted or unweighted) of characters. In DNA there are features that are characterized by composition rather than by position specific patterns.

Sequence Composition Composition is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let S be a string over Σ. Then, C(S)=(f σ 1, f σ 2, f σ 3, …, f σ |Σ| ) is the composition of S, where f σ i is the fraction of the characters in S that are σ i. is the composition of S, where f σ i is the fraction of the characters in S that are σ i.

Composition Example S = ACTGTACCTGGCGCTATT C(S) = ( 0.17, 0.28, 0.22, 0.33 ) A C G T A C G T Note that the order of letters is irrelevant as it has no effect on the composition.

Composition and Sequence Features Isochores – Multi-megabase, specifically GC-rich or GC- poor. GC-rich isochores have greater gene density.Isochores – Multi-megabase, specifically GC-rich or GC- poor. GC-rich isochores have greater gene density. CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression. Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.

Composition Match We hope to identify common features in sequences using a new alignment algorithm. The main new idea is the use of composition matching. Two strings, S and T, have a composition match if their lengths are equal and C(S) = C(T). For example, S and T below have a composition match: S = ACTGTACCTGGCGCTATT T = AAACCCCCGGGGTTTTTT

Composition Alignment Problem Given : Two sequences, S and T of lengths m and n, over an alphabet Σ, and a scoring function cm(s, t) for the score of a composition match between substrings s and t. Find: The best scoring alignment (global or local) of S with T such that the allowed scoring options include composition match between substrings of S and T as well as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.

Example of composition alignment S = AACGTCTTTGAGCTC T = AGCCTGACTGCCTA Alignment AACGTCTTTGAGCTC | | | | | | AGCCTGACT-GCCTA

Related Work Alignment allowing adjacent letter swap.Alignment allowing adjacent letter swap. O(nm), Lowrance and Wagner (1975) All swapped matchings of a pattern in a text.All swapped matchings of a pattern in a text. O(nm 1/3 log m log|Σ|), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000) O(n log m log |Σ|), Amir, Cole, Hariharan, Lewenstein, Porat (2001) Composition namingComposition naming O(n log m log |Σ|), Amir, Apostolico, Landau, Satta (2003)

Composition Alignment using Dynamic Programming Given two sequences, S and T, the best alignment of the prefix strings S[1, i] = s 1 … s i T[1, j] = t 1 … t j ends in one of four ways: 1.mismatch, 2.insertion, 3.deletion, or 4.composition match

Ways an Alignment Can End S: C G T T: C G A S: C A T T: C A - S: C A – T: C A A X: C G T A C Y: C G C T A mismatch insertion or deletion composition match

Ways an Alignment Can End S: C G T T: C G A S: C A T T: C A - S: C A – T: C A A X: C G T A C Y: C G C T A mismatch insertion or deletion composition match Note that the suffixes will have a length l where 1 ≤ l ≤ min(i, j, limit)

Time Complexity Computing the optimal composition alignment with dynamic programming is similar to standard alignment, except for the composition match scoring option. The overall time complexity is O(nmZ) where Z is the time required per (i, j) pair to find the best length l for the composition match.

Computing length of the shortest composition match Our goal here is to start with two strings, S and T, of equal length, and for each prefix pair S[1, k], T[1, k], find the length of the shortest suffixes that have a composition match.

k Shortest suffix match length For example, let S = AACGTCTTTGAGCT T = AGCCTGACTGCCTA the table states that for k = 6, the shortest suffixes which have a composition match have length = 3: S = AACGTC... T = AGCCTG...

Composition difference We find the matching suffix lengths using composition difference, a vector quantity for two strings x and y : CD(x, y) = (c σ 1, …, c σ |Σ| ) where c σ i is the difference between the number of times σ i occurs in x and in y.

Using composition difference Key observation: two identical composition differences at prefix lengths k and g indicate a composition match of length k – g.

Sorting to find shortest composition matches Sort on composition difference using stable sort. Adjacent tuples with the same composition difference identify shortest composition matches.

Time complexity for composition matches O(nmΣ) to find all index pairs shortest composition match lengths for two strings of length n and m. In our work, Σ, is a small constant (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.

Composition match scoring functions We have explored: Functions based on match length, k : Function 1: cm(k) = ckFunction 1: cm(k) = ck Function 2: cm(k) = c√ kFunction 2: cm(k) = c√ k where c is a constant. Functions based on substring composition: Function 4: cm(C, B, k) = ck · H(C,B)Function 4: cm(C, B, k) = ck · H(C,B) where H is the relative entropy function, C is the composition of the matching substrings and B is a background composition.

Additive and subadditive scoring functions The functions based on length are additive or subadditive: cm(i + j) ≤ cm(i) + cm(j) Lemma: For additive or subadditive composition match scoring functions, any best scoring alignment is equivalent in score to an alignment which contains only shortest composition matches. Theorem: Composition alignment with additive or subadditive match scoring functions and finite alphabet has time complexity O(nm).

The limit parameter Intuitively, allowing scrambled letters to match should increase the amount of matching between sequences. If too much matching occurs, alignments will not be meaningful. The limit parameter is an upper bound on the length l of the longest single composition match, used to prevent excessive matching. Sequence length = 100, randomly generated limit12510 DNA ( all letters p = 0.25)

Growth of local alignment score Function 1

Global score as a predictor of local parameter suitability: Function 1

Growth of local alignment score Function 2

Global score as a predictor of local parameter suitability: Function 2

Limit values for DNA Function 1: cm(k) = ck: Limit ≤ 3.Function 1: cm(k) = ck: Limit ≤ 3. Function 2: cm(k) = c√k: Limit ≤ 10.Function 2: cm(k) = c√k: Limit ≤ 10. Function 4: cm(C, B, k) = ck ·H(C, B):Function 4: cm(C, B, k) = ck ·H(C, B): Limit ≤ 50.

Biological examples Composition alignment was tested on a set of 1796 promoter sequences from the Eukaryotic Promoter Database. Each sequence is 600 nucleotides long, 500 bases upstream and 100 downstream of the transcription initiation site. Two local alignment scores were produced using function 1, W using composition alignment and S using standard alignment. The examples shown have statistically significant W with W ≥ 3 · S to exclude good standard alignments.

Example 1 Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically significant. Sequences are characteristic of CpG islands. Composition Alignment: GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC ||||<>|<>||<>| ||||<>||<> | |||||| <>|<> ||||<><> |<>| || || ||||<>|<>||<>| ||||<>||<> | |||||| <>|<> ||||<><> |<>| || ||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard Alignment: CGCCGCCGCCGCGCCGCCGCCG

Example 2 Composition alignment of two promoter sequences. Composition changes at vertical line. A C G T A C G T Left: (0.01, 0.61, 0.30, 0.08) Right: (0.19, 0.16, 0.56, 0.09) GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG |<><>|||| <>|||||| || |<>||||| <>|||| |||| || || | |<><>| | |<>|<>|<>|||| | |<><>|||| <>|||||| || |<>||||| <>|||| |||| || || | |<><>| | |<>|<>|<>|||| |CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG

Conclusion We define a new alignment problem based on composition matching and test several scoring functionsdefine a new alignment problem based on composition matching and test several scoring functions show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetshow how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabet show that alignment using scoring functions based on sequence length only require finding shortest composition matchesshow that alignment using scoring functions based on sequence length only require finding shortest composition matches give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsgive biological examples where composition alignment finds statistically (and functionally) significant sequence similarity in the absence of significant standard alignments