MUMmer 游騰楷杜海倫王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion.

Slides:

Advertisements

Similar presentations

Longest Common Subsequence

Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Suffix Trees Construction and Applications João Carreira 2008.

Two implementation issues Alphabet size Generalizing to multiple strings.

1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.

The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b

Next Generation Sequencing, Assembly, and Alignment Methods

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.

Goodrich, Tamassia String Processing1 Pattern Matching.

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Searching with Structured Keys Objectives

Suffix trees and suffix arrays presentation by Haim Kaplan.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.

General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.

Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.

Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.

1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.

String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.

Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.

Memory Management during Run Generation in External Sorting – Larson & Graefe.

Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (

Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short.

Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi

McCreight's suffix tree construction algorithm

Andrzej Ehrenfeucht, University of Colorado, Boulder

Ukkonen's suffix tree construction algorithm

Comparison of large sequences

Contents First week: algorithms for exact string matching:

String Data Structures and Algorithms

String Data Structures and Algorithms

Suffix trees and suffix arrays

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Suffix Arrays and Suffix Trees

String Matching with k Mismatches

Presentation transcript:

MUMmer 游騰楷杜海倫王慧芬曾俊雄 2007/01/02

Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Tree 游騰楷

Trie A Trie, also called a prefix tree, is a tree that records the information of strings. It likes a dictionary of those strings. Advantages comparing to BST: –Search time. –Space. –Prefix matching. –Balance.

Suffix Trie and Suffix Tree A suffix tree, Tree(T), is a compact trie that represents all the suffixes of a string T.

Suffix Trie and Suffix Tree (cont.) a b b a a a a a b b b a baab ab abaab baab aab ab b Suffix TrieSuffix Tree

Suffix Trie and Suffix Tree (cont.) Let |T| = n Suffix trie needs O(n 2 ) space. –Consider T = a n/2 b n/2 Suffix tree needs O(n) space. –Since every symbol can cause only one branch, total nodes (edges) cannot exceed O(n). –In a node we only record the starting position of corresponding suffix. –In a edge we need not to record the whole substring but the starting and ending positions of the substring.

1,1 2,5 4,5 2,5 Suffix Trie and Suffix Tree (cont.) 1:abaab 2:baab 3:aab 4:ab 5:b Suffix Tree 1:abaab$ 3:aab$ 2:baab$ 5:b$ 2,2 3,5 a baab ab Suffix Tree 3,54:ab$ 2,2

Suffix Tree: Full Text Index P occurs in T  P is a prefix of some suffix of T  Path for P exists in Tree(T)

Linear Time Construction 1973, Weiner gave the first linear time algorithm. 1976, McCreight gave a more readable algorithm. They were all processing from right to left. In 1992, Ukkonen gave a left-to-right on-line algorithm.

On-line Construction of Suffix Trie We have O(n 2 ) time to do this, it is quite easy. We construct Trie i from Trie i-1. We need to record the current end points of every suffix.

a a b b Trie(abaaba) a b b a a a b b a a a a a a b b a a a a a b b b a b b a a a a a b b b a a a

Two Lemmas We call the current end points of suffix i as C i. 1.If some C i ever branch out it will never branch again. 2.If C i does not branch, those C j also does not branch for j > i.

Ideas to Find Suffix Tree Thus –If one branched we never consider it again. –The branches have an order from small indices suffix to large indices. For a suffix we have 3 phases: 1.Going along original tree. 2.Branching a new leaf. 3.Growing the leaf.

Ideas to Find Suffix Tree (cont.) 3 phases: 1.Going along original tree. 2.Branching a new leaf. 3.Growing the leaf. We just record the longest suffix that is not branch yet. When it branches, we try to find next suffix. To do this, we need suffix links.

Suffix Links Suffix link is a pointer from an internal node xS to another internal node S. x  , and S =  *. Suffix tree of “abaabab$”: a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $

Using Suffix Link to Find Common Substring (1) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ Consider abaaaba

Using Suffix Link to Find Common Substring (2) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba abaa

Using Suffix Link to Find Common Substring (3) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba baa

Using Suffix Link to Find Common Substring (4) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba aa

Using Suffix Link to Find Common Substring (5) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba

Using Suffix Link to Find Common Substring (6) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba aaba

Using Suffix Link to Find Common Substring (7) a b a abab b b b a abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ $ abaaaba aba ba a

Construct Suffix Tree Like matching, we use suffix link to find next suffix.

Suffix Tree(abaabab$) 1,-2,- 1 step aba 1,- 2,- ab 1,- a 1,1 2,- abaa 2,- 4,- 1,1 2,- abaab 2,- 4,- 1 step 1,1 2,- abaaba 2,- 4,- 2 step

abaabab 1,1 2,- 4,- Branch abab 1,1 2,- 2,34,- 7,- Use suffix link to find this, and go down by “bab” bab 1,1 2,- 2,34,- 7,- Nearest ancestor internal node abab 1,1 2,3 4,- 7,- 4,- 7,- Nearest ancestor internal node bab 1,1 2,3 4,- 7,- 4,- 7,- ab 1 step

Suffix Tree(abaabab$) 1,1 2,2 3,3 4,- 7,- 4,- 7,- 2,2 8,- 3,3 8,- abaabab$ baabab$ aabab$ abab$ bab$ ab$ b$ 8,- $

Time Complexity For phase 3, we do nothing. For phase 1, we totally use O(n) time. For phase 2, we do O(n) times of branch. The only bottleneck is how fast can we find next position.

Time Complexity (cont.) Consider we branch at the red circle having distance t to its parent. Next suffix can at most pass through t internal nodes. Use amortized analysis, we can easily find total internal nodes the algorithm passing through is O(n). t steps

Time Complexity (cont.) Thus, constructing suffix tree needs O(n) time and O(n) space.

Applications Longest common substring. Repeating of a pattern in a string. Approximate matching. etc.

Suffix Array hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti Suffix array of hattivatti: (11, 7, 2, 1, 10, 5, 9, 4, 8, 3, 6) att binary search It can construct from suffix tree in linear time. (DFS)

Special Thanks to Esko Ukkonen Hsueh-I Lu Wikipedia Some figures of this slides are based on their slides.

Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 1.0 – Alignment of whole genomes D 杜海倫 2007/01/02

About The System - MUMmer For rapidly aligning whole genome sequences –Assumption: the sequences are closely related –Output: Alignment of the input sequences Highlighting the exact differences in the genomes –SNPs, insertions, significant repeats, tandem repeats, reversals –Main idea Suffix tree Longest increasing subsequence (LIS) Smith-Waterman alignment

Alignment step Perform a maximal unique match (MUM) decomposition of the two genomes -> Suffix tree Sort the MUMs, and extract the longest possible set of matches in the same order -> LIS Close gaps -> Smith-Waterman alignment Output!

Step 1 Suffix tree 83

Step 1 (cont’) Construct a suffix tree T for genome A Add the suffixes for genome B (implement: A+dummy character+B) Find out unique matching sequence: an internal node with exactly two child nodes, such that the child nodes are leaf nodes from different genomes Find out MUM: For highly similar genomes, set MUM>=50bp For more distantly related genomes, set MUM>=20bp

Step 2 Sort, LIS=> O(KlogK) => O(N) –K: the numbers of MUMs –K<<N/logN

Step 3 Process the gap into one of the four classes –SNP Genome A: cgtcataaagt Genome B: cgtcctaaagt –Insert Genome A: cgtctaaagtggggaaaactctgg Genome B: cgtctaaagt Ctctgg Transposition or simple insertions –Polymorphic regions Genome A: cgtctaaagtggggaaaactctgg Genome B: cgtctaaagta tgacaggctctgg Should be aligned –Repeat Genome A: aaggaaggaaggagct Genome B: aaggaagg.... agct

Result and Discussion Comparing two strain of tuberculosis –H37Rv and CDC1551 –>99% identical –Be able to catalog all SNPs all insertions of every length All tandem repeat with different copy numbers –Performance (DEC Alpha 4100) 5s for step1 45s for step 2 5s for step 3

Result and Discussion (cont’)

Comparing two Mycoplasma genome –M.genitalium (580074nt) and M.pneumoniae (226000nt) –Performance (DEC Alpha 4100) 6.5s for step1 0.02s for step 2 116s for step 3 –FASTA: many hours

Result and Discussion (cont’) FASTA 25mers MUMmer

Result and Discussion (cont’) Comparing human and mouse –222930bp of human chromosome 12 (accession no. U47924) and bp of mouse chromosome 6 (accession no. AC002397) – Performance -29s 1.6s for step1

Result and Discussion (cont’)

Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 2.1 王慧芬

MUMmer 2.1 Fast algorithms for large-scale genome alignment and comparison by Delcher, Phillippy, Carlton and Salzberg Nucleic Acids Research

Agenda Key in MUMmer 1 Improvements in MUMmer 2.1 Technical improvements in MUMmer 2.1 Application to DNA sequence alignment – Alignment of incomplete genomes

MUMmer 1 Key –Built a suffix tree containing 2 input sequences –Find all maximal unique matches (MUMs) between them. MUM (Maximal Unique Matches) –A subsequence occurred in 2 exactly matching copies, once in each input sequence –Cannot be extended in either direction

Improvements in MUMmer 2.1 Fast and less memory, by a factor of nearly three Able to align DNA or protein sequence MUM1MUM2 Time74s (1GHz)27s (1GHz) Mem293MB100MB To align 4.7 Mb genome of E. coli and 3.0Mb large chromosome of V.cholerae

Technical improvements in MUMmer 2.1 A reduction in amount of memory used to store suffix trees –Kurtz (1999) technique is used An alternative algorithm to find initial exact matches Cluster matches

Alternative to find initial exact matches MUMmer 1: –Built a suffix tree containing 2 input sequences MUMmer 2: –Chang-Lawler (1994) method is used running time is reduced –Built a suffix tree storing only one sequence (reference) –2nd sequence (query) streamed against the suffix tree memory usage is reduced by at least half once the suffix tree is built, arbitrarily long/multiple queries can be streamed.

Alternative to find initial exact matches (cont.) MUMmer 2 (cont.): –Identify where the query sequence would branch off from the tree, to find all matches –Unique match Wherever a branch occurs at a tree position with just a single leaf beneath it –Maximal match Using suffix links to find next match (extended match) By checking the character immediately preceding the start of this match, we can determine whether it is a maximal match To find all maximal matches, it is in time proportional to the length of the query

Unique match

Maximal match suffix links is used to find extended match

Cluster matches MUMmer 1: –Align 2 complete sequences, no rearrangement. –That is, to find a single longest alignment. MUMmer 2: –After matches are identified, the interval length between matches are checked. –If the interval length between matches is less than a user- defined gap length, the matches are joined into a cluster.

Alignment of incomplete genomes DNA sequencing Shotgun sequencing Terms Finishing NUCmer (NUCleotide MUMmer)

DNA sequencing –Human genomes are approx. 3 billion bases. –Sequencing machine can generate sequences for fragments in bp long. –In order to read DNA, genome is broken up into tiny of pieces (reads), each is read individually. –After all pieces are read, they are assembled in the correct order.

Shotgun sequencing Extract DNA Fragment DNA Clone DNA Sequence both ends of clones – bp each read Assemble –reads are assembled to reconstruct the genome Finish sequencing (close gaps)

Terms Raw sequence –Unassembled sequence reads Contig –Overlapping reads are joined into longer composite sequences, called contigs Finished sequence –Complete sequence of a genome with no gaps and an accuracy of >99.9% Full shotgun coverage –Genome coverage in random raw sequence required to produce finished sequence, 8-10 fold (8-10X)

Finishing The process of –Determining the order and orientation of all the contigs, and –Generating additional sequence to fill in all the gaps between them (closing all the gaps) Finishing phase is dispensed in many projects. However, MUMmer 2.1 is used as a base to build a program for the “finishing phase” –Align the multiple contigs to a completed reference genome, and –Align one set of contigs to another set of contigs

NUCmer (NUC leotide MUM mer ) Built based on MUMmer 2.1 to develop a miltiple-contig alignment program 3 steps Outputs

NUCmer – 3 steps Step 1 –Input: 2 DNA sequences in 2 multi-fasta files representing partial or complete assemblies. –Each DNA sequence is represented as contig sequence –Creates a map of all contig positions within each of the multi-fasta files –Concatenates the two files seperately –Runs MUMmer to find all exact matches between two genomes –These matches are mapped back to the separate contigs

NUCmer – 3 steps (cont.) Step 2 ： Clustering –MUMs (output of step 1) are clustered together if they are separated in user-defined distance Step 3 –Run a modified Smith-Waterman DP alignment to align the sequence between MUMs (output of step 2) Result –Alignment of “every sequence contig in the 1st file” to “every sequence contig in the 2nd file” –Order, orientation, and coverage identity percentage of contigs

Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Improvements in MUMmer 3.0 d 曾俊雄

Overview Functionality vs. Modularity

What’s New? Optimized suffix-tree library (rewrite) Non-unique maximal matches (New!) Distant matches (2.1)

Optimized Suffix-Tree The most significant improvement. –multi-contig query against multi-contig reference (continue) –rewrite, more compact (continue)

MUMmer 3.0, page 4

multi-contig query against multi-contig reference –already in MUMmer 2.1 through Nucmer package –now imported into the core

more compact suffix-tree

Some previous work: –Manber & Myers : 18.8n~22.4n bytes (DNA) –Kärkkäinen : 15n~18n (?) –Crochemore and Vérin : 32.7n –The strmat software package by Knight, Gusfield and Stoye : 24n~28n (string length at most 2^23)

Basic Idea: –Use different data structure for different kinds of tree nodes –duplicated information should be removed

Terms: –depth

–head position ab i w a b the longest w, the smallest i

Implementation by Stefan Kurtz : –n+5q (6n) integers !! for leaf node, depth and head position and some others are not required –(3+1/16)n integers !! more observation

a aW if the head position of aW is known, the head position of W will satisfy the constraint: aW.headposition + 1 >= W.headposition

case 1: = case 2: > aW WW for case 1, the internal node representing aW doesn ’ t even need the head position information

Non-unique Maximal Matches 1.0 : matches must be unique –may miss 2.0 : uniqueness required in reference sequence –still may miss

now, a command line to generate all maximal matches, regardless of uniqueness –in cost of very large output file

Distant Matches 1.0 : only 100% match is allowed –less sensitivity 2.1 : distant (not 100%) match is allowed –through extension packages 3.0 : improve

Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer is good at alignment between closely related species Distant matches are considered as MUMmer improved The open source License may have some advantage The space problem is still there