SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)
1

Outline Why science folks are unhappy with RDBMS
How we plan to fix that The details

Why SciDB? “Big science” very unhappy with RDBMS Astronomy HEP Fusion
Bio Remote sensing

Why? Experience of Sequoia 2000 (mid 1990s)
Tried to use Postgres for science databases Failed badly…… Main science data type is an array – horribly inefficient to simulate arrays on top of tables Required features absent (provenance, uncertainty, version control) SQL operations wrong (regrid – not join)

Why SciDB? Net result Community wants to get behind something better
Mentality of “roll your own from the ground up” for every new science project Realization by the science community that this is long-term suicide Community wants to get behind something better Great commonality of needs among domains

A Little Context XLDB-1 Genesis of the need
Asilomar conference (March 2008) Small conference to generate requirements

A Little Context March 2008 – September 2008 Initial design completed
Fund raising Recruiting of initial team Detailed use cases specified

Our Partnership Science and high-end commercial folks DBMS brain trust
Who will put up some resources And review design DBMS brain trust Who will design the system, oversee its construction, and perform needed research Non-profit company Which will manage the open source project And support the resulting system May need long term funding help

(We are recruiting more….)
Partners – Science (We are recruiting more….) LSST astronomy project DBMS work co-ordinated by SLAC Pacific Northwest National Laboratory (PNNL) Various bio projects Lawrence Livermore National Laboratory Fusion projects UCSB Remote sensing

Partners -- DBMS Mike Stonebraker (MIT)
Dave DeWitt (Wisconsin -> Microsoft) Jignesh Patel (Wisconsin) Jennifer Widom (Stanford) Dave Maier (Portland State) Stan Zdonik (Brown) Sam Madden (MIT) Ugur Cetintemal (Brown) Magda Balazinska (Washington) Mike Carey (UCI)

Partners -- Other E-Bay Vertica Microsoft LSST SLAC
Will hit up NSF and DOE

The SciDB Data Model Nothing (e.g. Hadoop, Pig, Hive, …)?
Most of you have schemas Hadoop is not a good starting point Slow No HA

The SciDB Data Model Tables? Makes a few of you happy
Used by Sloan Sky Survey But PanStarrs (Alex Szalay) wants arrays and scalability

The SciDB Data Model Arrays?
Superset of tables (tables with a primary key are a 1-D array) Makes HEP, remote sensing, astronomy, oceanography folks happy But Not biology and chemistry (who wants networks and sequences)

The SciDB Data Model Multidimensional grids
Superset of arrays (non-uniform cells) Makes solid modeling folks happy But Complex and slow

SciDB Data Model Nested multidimensional arrays
Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]

Basic Arrays Positive integer dimensions, no gaps Bounded or unbounded

Enhanced Arrays “Shape” function Supports irregular boundary

Enhanced Arrays Co-ordinate systems
User defined functions that map integers to something else E.g. mercator Use dimension notation to access, e.g. A[17,36] or A{468.2, 917.6}

SciDB Query Language “Parse-tree” representation of array operations
With a “binding” to: MatLab C++ Python IDL There may be more…. User extendable operations (Postgres-style)

Operations Standard relational ones (filter, join)
Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) Plus add your own (Postgres-style) We need science input here!!!

Environment and Storage
Extendable grid (cloud) of Linux machines With built-in high availability and failover And built in disaster recovery

In Situ Processing Operate on data with loading it
Supported by a SciDB self-describing file format And some number of adaptors, e.g. HDF-5, NetCDF Or write your own

Storage Model Arrays are “chunked” in storage Chunk size can vary
Chunks are partitioned across the grid Go for scalability to petabytes

Which Science Guys Want (These could be in RDBMS, but Aren’t)
Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) Uncertainty Data has error bars Which must be carried along in the computation (interval arithmetic) Will look at more sophisticated error models later

Other Features Provenance (lineage)
What calibration generated the data What was the “cooking” algorithm In general – repeatability of data derivation Supported by a command log with query facilities (interesting research problem) And redo

Other Features Time travel Spatial support Named versions
Don’t fix errors by overwrite I.e. keep all of the data Supported by an extra array dimension (history) Spatial support Named versions Recalibration usually handled this way Supported by allocating an array for the new version and “diffing” against its parent

Other Features (Optionally) integration of the real time data capture system “cooking” inside DBMS Makes provenance capture easier Sometimes important

Time Line Q4/08 start company, begin research activities Late 2009
Demoware available Late 2010 V1 ships

Project Organization (Build-it for real) CEO (Andy Palmer -- Vertica)
Project management (Bobbi Heath -- Vertica) CTO (Stonebraker)

Project Organization (Design and Research)
Overall co-ordination (Stonebraker, DeWitt) Storage and execution (Madden, Cetintemal) Query layer and semantics (Zdonik, Maier) Provenance (Widom, Patel) Resource management (Balazinska) Language bindings (Carey)

SciDB Has a Good Chance at Success
Community realizes shared infrastructure is good “Lighthouse” customers Strong team Computation goes inside the DBMS Easier to share And reuse

How Can You Help? Get involved!!!!

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

Similar presentations

Presentation on theme: "SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

Similar presentations

Presentation on theme: "SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1."— Presentation transcript:

Similar presentations

About project

Feedback