Download presentation
Presentation is loading. Please wait.
Published byBartholomew Henderson Modified over 9 years ago
1
SciDB An Open Source Data Base Project by Michael Stonebraker (and others)
1
2
Outline Why science folks are unhappy with RDBMS
How we plan to fix that The details
3
Why SciDB? “Big science” very unhappy with RDBMS Astronomy HEP Fusion
Bio Remote sensing
4
Why? Experience of Sequoia 2000 (mid 1990s)
Tried to use Postgres for science databases Failed badly…… Main science data type is an array – horribly inefficient to simulate arrays on top of tables Required features absent (provenance, uncertainty, version control) SQL operations wrong (regrid – not join)
5
Why SciDB? Net result Community wants to get behind something better
Mentality of “roll your own from the ground up” for every new science project Realization by the science community that this is long-term suicide Community wants to get behind something better Great commonality of needs among domains
6
A Little Context XLDB-1 Genesis of the need
Asilomar conference (March 2008) Small conference to generate requirements
7
A Little Context March 2008 – September 2008 Initial design completed
Fund raising Recruiting of initial team Detailed use cases specified
8
Our Partnership Science and high-end commercial folks DBMS brain trust
Who will put up some resources And review design DBMS brain trust Who will design the system, oversee its construction, and perform needed research Non-profit company Which will manage the open source project And support the resulting system May need long term funding help
9
(We are recruiting more….)
Partners – Science (We are recruiting more….) LSST astronomy project DBMS work co-ordinated by SLAC Pacific Northwest National Laboratory (PNNL) Various bio projects Lawrence Livermore National Laboratory Fusion projects UCSB Remote sensing
10
Partners -- DBMS Mike Stonebraker (MIT)
Dave DeWitt (Wisconsin -> Microsoft) Jignesh Patel (Wisconsin) Jennifer Widom (Stanford) Dave Maier (Portland State) Stan Zdonik (Brown) Sam Madden (MIT) Ugur Cetintemal (Brown) Magda Balazinska (Washington) Mike Carey (UCI)
11
Partners -- Other E-Bay Vertica Microsoft LSST SLAC
Will hit up NSF and DOE
12
The SciDB Data Model Nothing (e.g. Hadoop, Pig, Hive, …)?
Most of you have schemas Hadoop is not a good starting point Slow No HA
13
The SciDB Data Model Tables? Makes a few of you happy
Used by Sloan Sky Survey But PanStarrs (Alex Szalay) wants arrays and scalability
14
The SciDB Data Model Arrays?
Superset of tables (tables with a primary key are a 1-D array) Makes HEP, remote sensing, astronomy, oceanography folks happy But Not biology and chemistry (who wants networks and sequences)
15
The SciDB Data Model Multidimensional grids
Superset of arrays (non-uniform cells) Makes solid modeling folks happy But Complex and slow
16
SciDB Data Model Nested multidimensional arrays
Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]
17
Basic Arrays Positive integer dimensions, no gaps Bounded or unbounded
18
Enhanced Arrays “Shape” function Supports irregular boundary
19
Enhanced Arrays Co-ordinate systems
User defined functions that map integers to something else E.g. mercator Use dimension notation to access, e.g. A[17,36] or A{468.2, 917.6}
20
SciDB Query Language “Parse-tree” representation of array operations
With a “binding” to: MatLab C++ Python IDL There may be more…. User extendable operations (Postgres-style)
21
Operations Standard relational ones (filter, join)
Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) Plus add your own (Postgres-style) We need science input here!!!
22
Environment and Storage
Extendable grid (cloud) of Linux machines With built-in high availability and failover And built in disaster recovery
23
In Situ Processing Operate on data with loading it
Supported by a SciDB self-describing file format And some number of adaptors, e.g. HDF-5, NetCDF Or write your own
24
Storage Model Arrays are “chunked” in storage Chunk size can vary
Chunks are partitioned across the grid Go for scalability to petabytes
25
Which Science Guys Want (These could be in RDBMS, but Aren’t)
Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) Uncertainty Data has error bars Which must be carried along in the computation (interval arithmetic) Will look at more sophisticated error models later
26
Other Features Provenance (lineage)
What calibration generated the data What was the “cooking” algorithm In general – repeatability of data derivation Supported by a command log with query facilities (interesting research problem) And redo
27
Other Features Time travel Spatial support Named versions
Don’t fix errors by overwrite I.e. keep all of the data Supported by an extra array dimension (history) Spatial support Named versions Recalibration usually handled this way Supported by allocating an array for the new version and “diffing” against its parent
28
Other Features (Optionally) integration of the real time data capture system “cooking” inside DBMS Makes provenance capture easier Sometimes important
29
Time Line Q4/08 start company, begin research activities Late 2009
Demoware available Late 2010 V1 ships
30
Project Organization (Build-it for real) CEO (Andy Palmer -- Vertica)
Project management (Bobbi Heath -- Vertica) CTO (Stonebraker)
31
Project Organization (Design and Research)
Overall co-ordination (Stonebraker, DeWitt) Storage and execution (Madden, Cetintemal) Query layer and semantics (Zdonik, Maier) Provenance (Widom, Patel) Resource management (Balazinska) Language bindings (Carey)
32
SciDB Has a Good Chance at Success
Community realizes shared infrastructure is good “Lighthouse” customers Strong team Computation goes inside the DBMS Easier to share And reuse
33
How Can You Help? Get involved!!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.