Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center.

Similar presentations


Presentation on theme: "The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center."— Presentation transcript:

1 The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center

2 22008 MySQL Conference & ExpoJacek Becla, SLAC SLAC 50+ PB images 20+ PB database u Particle Physics u Photon Science u Astrophysics u Petascale data management

3 32008 MySQL Conference & ExpoJacek Becla, SLAC Data Explosion Enormous amount of digital information is produced …and processed

4 42008 MySQL Conference & ExpoJacek Becla, SLAC u Reality u Today’s trends u Future u Data-intensive science & industry Outline … of petascale analytics

5 52008 MySQL Conference & ExpoJacek Becla, SLAC Data-Intensive Scientific Community u Multi-decade experiments u Large, multi-tier collaborations u Distributed, heterogeneous environment u Specialized software Science  open source  Contingency  Customizations  Recompilation  Debuggability

6 62008 MySQL Conference & ExpoJacek Becla, SLAC Early 1990s Science & Petabytes Scientists were always drowning in data Scientists are drowning in data -- Jeannette M. Wing, Head Computer & Information Science & Engineering Directorate at NSF, 03/2008 Credit: Kirk Borne, GMU

7 72008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 1999 – 2008 u Few TB/sec –Small fraction saved u Billions of collisions u 4 PB data set u Petabyte database High Energy Physics: BaBar

8 82008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u ½ PB/sec –Small fraction saved u Trillions of collisions u 15 PB/year –Starting later this year High Energy Physics: LHC

9 92008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 4 PB in 2005 (images) NASA: Earth Observing System

10 102008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge lasers u Movies of molecules –Few MB x 120 Hz u Few PB/year Photon Science Credit: NIF, LLNL

11 112008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Trying to put together database of all known DNA sequences u Multi-petabytes Genomics

12 122008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge telescopes u Multi-gigapixel cameras u Getting ready for… –Trillions of observations –50+ PB of images –20+ PB database Astronomy

13 132008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes NASA BaBar LHC LSST BaBar

14 142008 MySQL Conference & ExpoJacek Becla, SLAC Science, Industry & Petabytes ? Google Yahoo! Microsoft AT&T Walmart EBay Facebook few others

15 152008 MySQL Conference & ExpoJacek Becla, SLAC Scientific Analytics Today u Complex computations –100s of attributes per query u Iterative, successively more restrictive u Curiosity driven questions u 3 major query types –Needle in haystack –Correlations –Time series

16 162008 MySQL Conference & ExpoJacek Becla, SLAC Hunt for Higgs Boson u Complex hierarchical tree-like structures with many relations u Events are uncorrelated Event TrackList TrackerCalor. Track Track Track Track Track HitList Hit Hit Hit Hit Hit Credit: Dirk Düllmann/CERN HEP: It’s All About “Events” Needle in haystack Spatial correlations Time series within event

17 172008 MySQL Conference & ExpoJacek Becla, SLAC Untangling the Universe u Overlapping u Moving u Disappearing u Highly correlated Astronomy: It’s All About “Astronomical Objects” Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series

18 182008 MySQL Conference & ExpoJacek Becla, SLAC Understanding Dynamics of Biological Processes Needle in haystack Correlations Time series

19 192008 MySQL Conference & ExpoJacek Becla, SLAC Future Scientific Analytics u Seamless integration with raw data u Annotation and sharing u Ubiquitous scientific data analytics –Instead of analytics for elite scientists u Mobile anytime anywhere –On open source data

20 202008 MySQL Conference & ExpoJacek Becla, SLAC Industry & Analytics u Most queries tool-generated u Lots of summaries and aggregates u Some very complex analytics –detecting fraudent activities –understanding hacker patterns –correlating ads with user behaviors u Starting to realize huge potential of data/logs Needle in haystack Correlations Time series Industrial analytics are becoming increasingly more complex

21 212008 MySQL Conference & ExpoJacek Becla, SLAC Scientific Approach to Petascale Analytics u Relational model insufficient u ODBMS didn’t take off u Files + metadata in db u Custom software u Filtering & grouping –Avoids small-granularity random reads –Organized activity, introduces delay u RDBMS – good match –but no multi-server setups yet u Bigger systems –files + metadata in db u Raw data in files –…or blobs inside database HEP Others

22 222008 MySQL Conference & ExpoJacek Becla, SLAC Industrial Approach to Petascale Analytics u Very few use databases for analytics u Trend: Map/Reduce paradigm –M/R, Hadoop, Dryad –Bigtable, HBase, Hypertable –Sawzall, Pig Latin, LINQ

23 232008 MySQL Conference & ExpoJacek Becla, SLAC Database… Map/Reduce… Files + Database… Is it really so different?

24 242008 MySQL Conference & ExpoJacek Becla, SLAC Maybe Not! You Must… u Manage lots of hardware u Learn to deal with failures u Parallelize u Optimize u Compromise u Automate

25 252008 MySQL Conference & ExpoJacek Becla, SLAC Manage Lots of Hardware u 6 GB / min  100 MB/sec (1 disk) u 1 PB / min  150,000 disks

26 262008 MySQL Conference & ExpoJacek Becla, SLAC Learn to Deal With Failures Large number of disks = Large number of nodes = Constant state of failures = Must recover transparently –don't think RAID or high-end hardware will save you Treat failures as normal state, not exceptions

27 272008 MySQL Conference & ExpoJacek Becla, SLAC Optimize or Go Bankrupt u How to organize data? u What to save? What to re-compute? u How to partition? u Row or column store? u What to index? u CPU/disk balance? u How much to compress? u How to formulate query?

28 282008 MySQL Conference & ExpoJacek Becla, SLAC Compromise or Die u Performance killers –Transactions –Foreign keys

29 292008 MySQL Conference & ExpoJacek Becla, SLAC Automate u 1 PB = –20 years of movies (HD) –2,000 years of MP3 (128 kbits/sec) u Too much data to browse or comprehend u Auto-load balance your data

30 302008 MySQL Conference & ExpoJacek Becla, SLAC And Your Biggest Problem Is... power and cooling –tape is cool –flash disks are coming

31 312008 MySQL Conference & ExpoJacek Becla, SLAC Hot or Not DBig, monolithic systems DShared all, shared disks DSpecialized hardware CLightweight, flexible specialized components with open interfaces CCommodity hardware CShared nothing

32 322008 MySQL Conference & ExpoJacek Becla, SLAC Scale or Sophistication? sophistication scale Matlab, SAS DBMS Map/Reduce Overhead too big for small problems Uses resources inefficiently Schema inside code Costly scalability Progressively expensive fault tolerance Inflexible schema

33 332008 MySQL Conference & ExpoJacek Becla, SLAC What Is Next? sophistication scale Map/Reduce Adding Schema SQL (hive) More indexes New, more scalable engines Brand new DBMSes Planning to scale DBMS Matlab, SAS

34 342008 MySQL Conference & ExpoJacek Becla, SLAC Database Features Needed u Scalability up to 100s of petabytes (higher tomorrow) u Parallelized single queries on commodity hardware u Fault tolerant with intra-query failover u Procedural user-defined functions/stored procedures that could be executed in parallel u Shared scans u Partial results u Query pause/restart/abort u Pre-execution query cost estimate u Resource management system u Support for arrays as a first-class column type u Support for provenance of data elements u Support for uncertainty of data elements u Support for spatial and temporal operations Scientific Point of View 

35 352008 MySQL Conference & ExpoJacek Becla, SLAC Will They Be There for LSST? u Convincing database camp to build it all for us –Working with many u Testing pure Map/Reduce + Bigtable –Collaborating with Google u Prototyping with custom software plus off-the-shelf RDBMS –Using MySQL 2009 – choosing technology 2010 – 2014: construction 2014 – 2023: production

36 362008 MySQL Conference & ExpoJacek Becla, SLAC Summary u Data avalanche u Need scalable, sophisticated tools u You are facing it too Credit: ncids.org


Download ppt "The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center."

Similar presentations


Ads by Google