Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
15-853Page : Algorithms in the Real World Suffix Trees.
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
CS 171: Introduction to Computer Science II
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
Goodrich, Tamassia String Processing1 Pattern Matching.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
December Rocky Mountain Bioinformatics Conference Speaker: Carl Bergenhem Co-Authors: Chih Lee, Chun-Hsi Huang Dept. of Computer Science and Engineering.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Data Structures – LECTURE 10 Huffman coding
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
Assignment 4. (Due on Dec 2. 2:30 p.m.) This time, Prof. Yao and I can explain the questions, but we will NOT tell you how to solve the problems. Question.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
CS 61B Data Structures and Programming Methodology Aug 11, 2008 David Sun.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
1 More Sorting radix sort bucket sort in-place sorting how fast can we sort?
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Whole genome comparison Kelley Crouse And Greg Matuszek.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
Coverage Criteria for Testing of Object Interactions in Sequence Diagrams Atanas (Nasko) Rountev Scott Kagan Jason Sawin Ohio State University.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Foundation of Computing Systems
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Algorithms Design Fall 2016 Week 6 Hash Collusion Algorithms and Binary Search Trees.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
13 Text Processing Hongfei Yan June 1, 2016.
Tonga Institute of Higher Education
(2,4) Trees 11/15/2018 9:25 AM Sorting Lower Bound Sorting Lower Bound.
Selection in heaps and row-sorted matrices
(2,4) Trees 12/4/2018 1:20 PM Sorting Lower Bound Sorting Lower Bound.
Chapter 11 Data Compression
Algorithms Take a look at the worksheet. What do we already know, and what will we have to learn in this term?
Binary Search Trees.
Data Structure and Algorithms
Quick-Sort 2/25/2019 2:22 AM Quick-Sort     2
(2,4) Trees 2/28/2019 3:21 AM Sorting Lower Bound Sorting Lower Bound.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Podcast Ch23d Title: Huffman Compression
COMPILER CONSTRUCTION
Presentation transcript:

Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

SimpleScalar Suite Linux Based Cache Simulator Allows for simulation of predefined cache environments Cross-compiles code for Simulation Through Linux GCC Fortran or C code can be compiled specifically for the SimpleScalar to allow complete execution of the code and keeping statistics

Sim-cache General sim-cache Code run through sim-cache uses the following paramaters – Number of sets in the structure – Block size – Associativity – Replacement policy What this lets us do Can simulate how well a program will perform on different types of CPUs in regards to cache simulation.

Idea of a Suffix Tree A Suffix-Tree is a data structure that creates a path from the root to a leaf for each suffix of the input string. Ex: A seven letter string will have seven leaves

Idea of a Suffix Tree The internal nodes of a tree are created when the start of a suffix is the same as another suffix Ex: From “banana”, “anana” and “ana” both start with “ana” so they can share the same path from the root until the end where they diverge

Building a Tree Starting from an empty root, and building the suffix tree for “banana” The first step...

Building a Tree

Recap As seen, it is a simple process in a number of iterations equal to the length of the input string to create the suffix tree

Use Fast String Comparisons Can be made in a number of comparisons of a most the length of the second to be compared string.

Example

REPuter The REPuter algorithm is a genetic algorithm that uses the Suffix Tree to efficiently find maximal repeats

Maximal Repeats A maximal repeat requires that within a string, there exists a substring that occurs at least twice and is at least of length equal to a set threshold length.

Example With a threshold value of 2, the word “banana” has the following maximal repeats “ana” appears twice “an” appears twice “na” appears twice

Use Scientists use the REPuter algorithm to find common substrings within a genome sequence that are of a certain length. A useful extension of this algorithm is to find similar substrings that can account for mutations in the DNA

How It Works The REPuter algorithm uses the suffix tree structure by traversing the entire tree, and whenever it is on a node that represents a string longer than the threshold, it is a valid maximal repeat so long as that node has 2 or more children nodes

Example

PSP Algorithm Probe Selection Problem (PSP) Algorithm – Relies upon the Suffix Tree to function. – Contains a set S of genomic sequences. – In order to find an olignucleotide (probe) for each sequence, a suffix tree of all the sequences is used. – Allows the probe to be identified in such a way that hybridization can occur for a specific sequence and that sequence only – Also grants the temperature at which the hybridization can occur