The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center
22008 MySQL Conference & ExpoJacek Becla, SLAC SLAC 50+ PB images 20+ PB database u Particle Physics u Photon Science u Astrophysics u Petascale data management
32008 MySQL Conference & ExpoJacek Becla, SLAC Data Explosion Enormous amount of digital information is produced …and processed
42008 MySQL Conference & ExpoJacek Becla, SLAC u Reality u Today’s trends u Future u Data-intensive science & industry Outline … of petascale analytics
52008 MySQL Conference & ExpoJacek Becla, SLAC Data-Intensive Scientific Community u Multi-decade experiments u Large, multi-tier collaborations u Distributed, heterogeneous environment u Specialized software Science open source Contingency Customizations Recompilation Debuggability
62008 MySQL Conference & ExpoJacek Becla, SLAC Early 1990s Science & Petabytes Scientists were always drowning in data Scientists are drowning in data -- Jeannette M. Wing, Head Computer & Information Science & Engineering Directorate at NSF, 03/2008 Credit: Kirk Borne, GMU
72008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 1999 – 2008 u Few TB/sec –Small fraction saved u Billions of collisions u 4 PB data set u Petabyte database High Energy Physics: BaBar
82008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u ½ PB/sec –Small fraction saved u Trillions of collisions u 15 PB/year –Starting later this year High Energy Physics: LHC
92008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 4 PB in 2005 (images) NASA: Earth Observing System
MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge lasers u Movies of molecules –Few MB x 120 Hz u Few PB/year Photon Science Credit: NIF, LLNL
MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Trying to put together database of all known DNA sequences u Multi-petabytes Genomics
MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge telescopes u Multi-gigapixel cameras u Getting ready for… –Trillions of observations –50+ PB of images –20+ PB database Astronomy
MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes NASA BaBar LHC LSST BaBar
MySQL Conference & ExpoJacek Becla, SLAC Science, Industry & Petabytes ? Google Yahoo! Microsoft AT&T Walmart EBay Facebook few others
MySQL Conference & ExpoJacek Becla, SLAC Scientific Analytics Today u Complex computations –100s of attributes per query u Iterative, successively more restrictive u Curiosity driven questions u 3 major query types –Needle in haystack –Correlations –Time series
MySQL Conference & ExpoJacek Becla, SLAC Hunt for Higgs Boson u Complex hierarchical tree-like structures with many relations u Events are uncorrelated Event TrackList TrackerCalor. Track Track Track Track Track HitList Hit Hit Hit Hit Hit Credit: Dirk Düllmann/CERN HEP: It’s All About “Events” Needle in haystack Spatial correlations Time series within event
MySQL Conference & ExpoJacek Becla, SLAC Untangling the Universe u Overlapping u Moving u Disappearing u Highly correlated Astronomy: It’s All About “Astronomical Objects” Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series
MySQL Conference & ExpoJacek Becla, SLAC Understanding Dynamics of Biological Processes Needle in haystack Correlations Time series
MySQL Conference & ExpoJacek Becla, SLAC Future Scientific Analytics u Seamless integration with raw data u Annotation and sharing u Ubiquitous scientific data analytics –Instead of analytics for elite scientists u Mobile anytime anywhere –On open source data
MySQL Conference & ExpoJacek Becla, SLAC Industry & Analytics u Most queries tool-generated u Lots of summaries and aggregates u Some very complex analytics –detecting fraudent activities –understanding hacker patterns –correlating ads with user behaviors u Starting to realize huge potential of data/logs Needle in haystack Correlations Time series Industrial analytics are becoming increasingly more complex
MySQL Conference & ExpoJacek Becla, SLAC Scientific Approach to Petascale Analytics u Relational model insufficient u ODBMS didn’t take off u Files + metadata in db u Custom software u Filtering & grouping –Avoids small-granularity random reads –Organized activity, introduces delay u RDBMS – good match –but no multi-server setups yet u Bigger systems –files + metadata in db u Raw data in files –…or blobs inside database HEP Others
MySQL Conference & ExpoJacek Becla, SLAC Industrial Approach to Petascale Analytics u Very few use databases for analytics u Trend: Map/Reduce paradigm –M/R, Hadoop, Dryad –Bigtable, HBase, Hypertable –Sawzall, Pig Latin, LINQ
MySQL Conference & ExpoJacek Becla, SLAC Database… Map/Reduce… Files + Database… Is it really so different?
MySQL Conference & ExpoJacek Becla, SLAC Maybe Not! You Must… u Manage lots of hardware u Learn to deal with failures u Parallelize u Optimize u Compromise u Automate
MySQL Conference & ExpoJacek Becla, SLAC Manage Lots of Hardware u 6 GB / min 100 MB/sec (1 disk) u 1 PB / min 150,000 disks
MySQL Conference & ExpoJacek Becla, SLAC Learn to Deal With Failures Large number of disks = Large number of nodes = Constant state of failures = Must recover transparently –don't think RAID or high-end hardware will save you Treat failures as normal state, not exceptions
MySQL Conference & ExpoJacek Becla, SLAC Optimize or Go Bankrupt u How to organize data? u What to save? What to re-compute? u How to partition? u Row or column store? u What to index? u CPU/disk balance? u How much to compress? u How to formulate query?
MySQL Conference & ExpoJacek Becla, SLAC Compromise or Die u Performance killers –Transactions –Foreign keys
MySQL Conference & ExpoJacek Becla, SLAC Automate u 1 PB = –20 years of movies (HD) –2,000 years of MP3 (128 kbits/sec) u Too much data to browse or comprehend u Auto-load balance your data
MySQL Conference & ExpoJacek Becla, SLAC And Your Biggest Problem Is... power and cooling –tape is cool –flash disks are coming
MySQL Conference & ExpoJacek Becla, SLAC Hot or Not DBig, monolithic systems DShared all, shared disks DSpecialized hardware CLightweight, flexible specialized components with open interfaces CCommodity hardware CShared nothing
MySQL Conference & ExpoJacek Becla, SLAC Scale or Sophistication? sophistication scale Matlab, SAS DBMS Map/Reduce Overhead too big for small problems Uses resources inefficiently Schema inside code Costly scalability Progressively expensive fault tolerance Inflexible schema
MySQL Conference & ExpoJacek Becla, SLAC What Is Next? sophistication scale Map/Reduce Adding Schema SQL (hive) More indexes New, more scalable engines Brand new DBMSes Planning to scale DBMS Matlab, SAS
MySQL Conference & ExpoJacek Becla, SLAC Database Features Needed u Scalability up to 100s of petabytes (higher tomorrow) u Parallelized single queries on commodity hardware u Fault tolerant with intra-query failover u Procedural user-defined functions/stored procedures that could be executed in parallel u Shared scans u Partial results u Query pause/restart/abort u Pre-execution query cost estimate u Resource management system u Support for arrays as a first-class column type u Support for provenance of data elements u Support for uncertainty of data elements u Support for spatial and temporal operations Scientific Point of View
MySQL Conference & ExpoJacek Becla, SLAC Will They Be There for LSST? u Convincing database camp to build it all for us –Working with many u Testing pure Map/Reduce + Bigtable –Collaborating with Google u Prototyping with custom software plus off-the-shelf RDBMS –Using MySQL 2009 – choosing technology 2010 – 2014: construction 2014 – 2023: production
MySQL Conference & ExpoJacek Becla, SLAC Summary u Data avalanche u Need scalable, sophisticated tools u You are facing it too Credit: ncids.org