A database index to large biological sequences

Slides:

Advertisements

Similar presentations

IITB - Bioinformatics Workshop Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science.

Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Tries Standard Tries Compressed Tries Suffix Tries.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

11 Decembre 2000V. Breton Milan WP6 DataGRID meeting Biological applications in testbed 0 Evaluate GRID added value for handling biological data –What.

Modern Information Retrieval

Xyleme A Dynamic Warehouse for XML Data of the Web.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.

1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.

Sequence alignment, E-value & Extreme value distribution

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

Database Management 9. course. Execution of queries.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

CS 430: Information Discovery

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Doug Raiford Phage class: introduction to sequence databases.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups

Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.

ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.

Why indexing? For efficient searching of a document

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

B/B+ Trees 4.7.

Tries 07/28/16 11:04 Text Compression

CPS216: Data-intensive Computing Systems

New Indices for Text : Pat Trees and PAT Arrays

Reducing the Space Requirement of LZ-index

Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

13 Text Processing Hongfei Yan June 1, 2016.

CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.

B- Trees D. Frey with apologies to Tom Anastasio

Selected Topics: External Sorting, Join Algorithms, …

B- Trees D. Frey with apologies to Tom Anastasio

B- Trees D. Frey with apologies to Tom Anastasio

Suffix Arrays and Suffix Trees

Sequences 5/17/ :43 AM Pattern Matching.

Basic Local Alignment Search Tool

Presentation transcript:

A database index to large biological sequences Ela Hunt Malcolm Atkinson Rob Irving Dept of Computing Science VLDB, Rome, 11th Sept 2001

Overview Current status of biological sequence analysis Proposal - indexed access to sequence Suffix trees How to build a large suffix tree Recent measurements

Biological sequence searching Queries performed each time a sequence is identified, to check if this a new sequence and what it is related to Data: 16 GB of DNA {A,C,G,T}, 200 MB of protein (20 letters), 1 char = 1B Approximate matching using cost matrices and gap cost functions Query lengths 5 to xM letters (compare 2 genomes)

Current approach BLAST (basic local alignment search tool, Altschul et al ‘90,’97): serial scanning, CPU intensive, computer farms, over 400 at the Sanger Centre UK, web interface (a query at a time), email interface, no DB integration based on heuristics and statistical measures of the goodness of match (reflecting the size of the database queried) Then filter relevant matches, refine the alignment using another package, or edit manually for input to other tools, use SQL to place on a map

Example Rat genes which might be involved in hypertension (1000 x 300-600 chars) Match against the human and mouse genomes (6 GB) Find out which rat genes map to known/ unknown genes Human chromosome 21 fragment

Motivation for sequence indexing faster (economy) remove reliance on the external service and network delays (user independence) integrate fully with a database engine (convenience) exhaustive instead of heuristics (quality) enable different statistics in sequence evaluation (flexibility)

Indexed query scenario One suffix tree index to all data of one type, implemented as a partitioned, (possibly distributed) tree Query broken into fragments, results processed based on user requirements (threshold, statistics) Display in biological context (organism, chromosome, match quality) as genome maps

Suffix trees O(n) construction, Weiner, McCreight, Ukkonen suffix links need extra space space optimisations by Kurtz Naïve tree, construction O(n2) in the worst case. O(nlogn) average. Navarro and Baeza-Yates build a suffix array for 10MB and a suffix tree for 1 MB (using 64 MB RAM) and claim that a suffix tree > RAM is not practical => OUR WORK

Suffix tree (no links) Compressed trie of all suffixes (add $) root $ 5 A 4 T$ 2 CAT$ Each child starts with a different letter String length n, n leaves (each suffix represents a unique path from root to leaf) CAT$ T$ 3 1

Suffix links enable O(n) tree construction A suffix link points from a node indexing the suffix aw where |a|=1 to a node indexing the suffix w $ ACACACAC$ AC C $ AC $ AC $ AC $ AC AC$ $ $ AC$

Consequences of suffix links Additional storage (one reference per node) Need to traverse the links (a large part of the tree) at construction time, random access to disk Need to update the tree to add links and nodes as new suffixes are being added (scattered updates)

Removal of suffix links Subtree for $ Subtree for A* Subtree for C* Text max 20.5 MB using 2 GB RAM Text: 200 - 300 MB

Tree representation Text: array of bytes, a byte per letter Our implementation Original tree (Ukkonen) left index child sib left index right index suffix number Look up right index in the child (inner node ) or in the global text length (leaf). Calculate suffix number at query time: leftIndex - string depth + 1 child sib suffix link

Building a suffix tree for arabesque$ 1st suffix besque$ arabesque$ 2nd suffix besque$ rabesque$ rabesque$ rabesque$ rabesque$ a A node is either added as a child or first causes a node split and then its suffix is added as a child. 3rd suffix node split besque$

A protein/DNA tree DNA - 263 mln letters, 18 GB store, 2 GB log Protein - 200 mln letters, all existing and predicted proteins from SWISSPROT and TREMBL, 12.5 GB store, 2GB log Using PJama developed at Glasgow / SUNLabs. PJama implements orthogonal persistence for Java (minimal programming cost, 5 lines of code) 68 B per DNA character in the persistent store (incl.. overheads, free space), 1.6-1.8 nodes/letter Protein trees have fewer nodes

Trees built in stages Partitions based on scanning text for 3-letter prefixes of each suffix: AAA, AAC, AAG,.. commit after a number of full partitions has been built protein, based on 2 character prefixes slow, 8-14 hrs for 200-300MB text (Sun E450, 330 MHz, 2GB RAM), disk activity dominates (writing to the log and store)

Tree building in memory using 2 GB RAM justifies the use of the naïve tree

Database issues Light-weight persistence (write once use many times) Space overheads too high Generic persistence mechanism indispensable to investigate alternative data structures PJama allows for fast experimentation Good performance on querying, but disk activity dominates, optimisation needed

Exact queries

New results - approximate matching on 200 MB protein Results for a query length 300 compared to a full similarity matrix calculation (300*200M), executed under same conditions in memory, using unit costs.

Further work Tree compression, alternative data structures, data clustering A tunable experimental platform (space overheads, performance) Results filtering (constructing the result from matching fragments) Statistics Parallel distributed tree construction and use (GRID)

Acknowledgements PJama team at Glasgow and SunLabs Biologists at Glasgow: Keith Johnson (neurodegenerative diseases), Anna Dominiczak (hypertension)