Download presentation
Presentation is loading. Please wait.
Published byDaniela Andrews Modified over 9 years ago
1
O|B|F Flatfile Indexing Andrew Dalke Dalke Scientific Software, LLC One of the Biohackathon projects
2
Use case Sally, a bioinformatics researcher, needs fast access to many different records from GenBank. She is in a small group with little experience in database management systems so wants a simple system that doesn't involve a client/server model. She also wants the different tools she has (written for the different Bio* projects) to be able to access the system, so she doesn't need to continuously extract data with one tool for use by another.
3
Background ● Have a set of large data files ● Each contains many records ● Records have identifiers ● id, accession, gid, entry name, etc. ● Want to retrieve a record given an identifier ● Don't want to set up a database server ➔ Make an indexer
4
Indexer ● Nothing new here ● "Everyone" has written one ● Spec out a standard and use it
5
"Schema" (filename, start byte, length) Primary identifier Secondary identifier *.... Secondary identifier * * * (Actually, normalized to fileid)
6
Index as flat-file P12345 \t 1 \t 10000 \t 100 GI22222 \t P00012 GI22222 \t P12345 GI22223 \t P86753.... id_ACC.index key_ID.key config.dat index \t flat/1 fileid_1 \t /path/to/here fileid_2 \t /path/to/there.... The.key and.index files are fixed width and sorted. Allows fast binary searches.
7
Index in BerkeleyDB ● Use BDB tables for the key/value information ● Faster ● More scalable ● Easier to edit, modify ● More space efficient ● But it has an external dependency Client code can determine the format automatically
8
Bio* support Biopython - Andrew Dalke Bioperl - Michele Clamp & Lincoln Stein BioJava - Matthew Pocock BioRuby - Toshiaki Katayama (starting) BioC - Steve Searle And they really do interoperate!
9
TODO Still tweaking the spec ● How to handle format ● non-ASCII filenames / internationalization Need a cross-platform regression test suite
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.