Presentation is loading. Please wait.

Presentation is loading. Please wait.

A database index to large biological sequences

Similar presentations


Presentation on theme: "A database index to large biological sequences"— Presentation transcript:

1 A database index to large biological sequences
Ela Hunt Malcolm Atkinson Rob Irving Dept of Computing Science VLDB, Rome, 11th Sept 2001

2 Overview Current status of biological sequence analysis
Proposal - indexed access to sequence Suffix trees How to build a large suffix tree Recent measurements

3 Biological sequence searching
Queries performed each time a sequence is identified, to check if this a new sequence and what it is related to Data: 16 GB of DNA {A,C,G,T}, 200 MB of protein (20 letters), 1 char = 1B Approximate matching using cost matrices and gap cost functions Query lengths 5 to xM letters (compare 2 genomes)

4 Current approach BLAST (basic local alignment search tool, Altschul et al ‘90,’97): serial scanning, CPU intensive, computer farms, over 400 at the Sanger Centre UK, web interface (a query at a time), interface, no DB integration based on heuristics and statistical measures of the goodness of match (reflecting the size of the database queried) Then filter relevant matches, refine the alignment using another package, or edit manually for input to other tools, use SQL to place on a map

5 Example Rat genes which might be involved in hypertension (1000 x chars) Match against the human and mouse genomes (6 GB) Find out which rat genes map to known/ unknown genes Human chromosome 21 fragment

6 Motivation for sequence indexing
faster (economy) remove reliance on the external service and network delays (user independence) integrate fully with a database engine (convenience) exhaustive instead of heuristics (quality) enable different statistics in sequence evaluation (flexibility)

7 Indexed query scenario
One suffix tree index to all data of one type, implemented as a partitioned, (possibly distributed) tree Query broken into fragments, results processed based on user requirements (threshold, statistics) Display in biological context (organism, chromosome, match quality) as genome maps

8 Suffix trees O(n) construction, Weiner, McCreight, Ukkonen
suffix links need extra space space optimisations by Kurtz Naïve tree, construction O(n2) in the worst case. O(nlogn) average. Navarro and Baeza-Yates build a suffix array for 10MB and a suffix tree for 1 MB (using 64 MB RAM) and claim that a suffix tree > RAM is not practical => OUR WORK

9 Suffix tree (no links) Compressed trie of all suffixes (add $) root $
5 A 4 T$ 2 CAT$ Each child starts with a different letter String length n, n leaves (each suffix represents a unique path from root to leaf) CAT$ T$ 3 1

10 Suffix links enable O(n) tree construction
A suffix link points from a node indexing the suffix aw where |a|=1 to a node indexing the suffix w $ ACACACAC$ AC C $ AC $ AC $ AC $ AC AC$ $ $ AC$

11 Consequences of suffix links
Additional storage (one reference per node) Need to traverse the links (a large part of the tree) at construction time, random access to disk Need to update the tree to add links and nodes as new suffixes are being added (scattered updates)

12 Removal of suffix links
Subtree for $ Subtree for A* Subtree for C* Text max 20.5 MB using 2 GB RAM Text: MB

13 Tree representation Text: array of bytes, a byte per letter
Our implementation Original tree (Ukkonen) left index child sib left index right index suffix number Look up right index in the child (inner node ) or in the global text length (leaf). Calculate suffix number at query time: leftIndex - string depth + 1 child sib suffix link

14 Building a suffix tree for arabesque$
1st suffix besque$ arabesque$ 2nd suffix besque$ rabesque$ rabesque$ rabesque$ rabesque$ a A node is either added as a child or first causes a node split and then its suffix is added as a child. 3rd suffix node split besque$

15 A protein/DNA tree DNA - 263 mln letters, 18 GB store, 2 GB log
Protein mln letters, all existing and predicted proteins from SWISSPROT and TREMBL, 12.5 GB store, 2GB log Using PJama developed at Glasgow / SUNLabs. PJama implements orthogonal persistence for Java (minimal programming cost, 5 lines of code) 68 B per DNA character in the persistent store (incl.. overheads, free space), nodes/letter Protein trees have fewer nodes

16 Trees built in stages Partitions based on scanning text for 3-letter prefixes of each suffix: AAA, AAC, AAG,.. commit after a number of full partitions has been built protein, based on 2 character prefixes slow, 8-14 hrs for MB text (Sun E450, 330 MHz, 2GB RAM), disk activity dominates (writing to the log and store)

17 Tree building in memory using 2 GB RAM justifies the use of the naïve tree

18 Database issues Light-weight persistence (write once use many times)
Space overheads too high Generic persistence mechanism indispensable to investigate alternative data structures PJama allows for fast experimentation Good performance on querying, but disk activity dominates, optimisation needed

19 Exact queries

20 New results - approximate matching on 200 MB protein
Results for a query length 300 compared to a full similarity matrix calculation (300*200M), executed under same conditions in memory, using unit costs.

21 Further work Tree compression, alternative data structures, data clustering A tunable experimental platform (space overheads, performance) Results filtering (constructing the result from matching fragments) Statistics Parallel distributed tree construction and use (GRID)

22 Acknowledgements PJama team at Glasgow and SunLabs
Biologists at Glasgow: Keith Johnson (neurodegenerative diseases), Anna Dominiczak (hypertension)


Download ppt "A database index to large biological sequences"

Similar presentations


Ads by Google