The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center.

Slides:



Advertisements
Similar presentations
DataGarage: Warehousing Massive Performance Data on Commodity Servers
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,
July 8, 2008SLAC Annual Program ReviewPage 1 LSST Data Management and Access Jacek Becla LSST Data Access & Database Technology Group Leader.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
A Comparison of Database Software CS 616 April 8, 2004 Team 7 Mandar Patankar Jonathan Cohen B. Timothy Walsh.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Welcome!.
Cloud Computing Other Mapreduce issues Keke Chen.
Big Data A big step towards innovation, competition and productivity.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
1 Jacek Becla, XLDB-Europe, CERN, May 2013 LSST Database Jacek Becla.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Lessons Learned from Managing a Petabyte Jacek Becla Stanford Linear Accelerator Center (SLAC) Daniel Wang now University of CA in Irvine, formerly SLAC.
1 Analysis with Extremely Large Datasets Jacek Becla SLAC National Accelerator Laboratory CHEP’2012 New York, USA.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Scientific Computing at SLAC: The Transition to a Multiprogram Future Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear.
On the Verge of One Petabyte – the Story Behind the BaBar Database System Jacek Becla Stanford Linear Accelerator Center For the BaBar Computing Group.
Nov 2006 Google released the paper on BigTable.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
BIG DATA/ Hadoop Interview Questions.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
How did it start? • At Google • • • • Lots of semi structured data
Pathology Spatial Analysis February 2017
Dremel.
Project Project mid-term report due on 25th October at midnight Format
Storage Systems for Managing Voluminous Data
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Overview of big data tools
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

The Science and Fiction of Petascale Analytics Jacek Becla Stanford Linear Accelerator Center

22008 MySQL Conference & ExpoJacek Becla, SLAC SLAC 50+ PB images 20+ PB database u Particle Physics u Photon Science u Astrophysics u Petascale data management

32008 MySQL Conference & ExpoJacek Becla, SLAC Data Explosion Enormous amount of digital information is produced …and processed

42008 MySQL Conference & ExpoJacek Becla, SLAC u Reality u Today’s trends u Future u Data-intensive science & industry Outline … of petascale analytics

52008 MySQL Conference & ExpoJacek Becla, SLAC Data-Intensive Scientific Community u Multi-decade experiments u Large, multi-tier collaborations u Distributed, heterogeneous environment u Specialized software Science  open source  Contingency  Customizations  Recompilation  Debuggability

62008 MySQL Conference & ExpoJacek Becla, SLAC Early 1990s Science & Petabytes Scientists were always drowning in data Scientists are drowning in data -- Jeannette M. Wing, Head Computer & Information Science & Engineering Directorate at NSF, 03/2008 Credit: Kirk Borne, GMU

72008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 1999 – 2008 u Few TB/sec –Small fraction saved u Billions of collisions u 4 PB data set u Petabyte database High Energy Physics: BaBar

82008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u ½ PB/sec –Small fraction saved u Trillions of collisions u 15 PB/year –Starting later this year High Energy Physics: LHC

92008 MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u 4 PB in 2005 (images) NASA: Earth Observing System

MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge lasers u Movies of molecules –Few MB x 120 Hz u Few PB/year Photon Science Credit: NIF, LLNL

MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Trying to put together database of all known DNA sequences u Multi-petabytes Genomics

MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes u Huge telescopes u Multi-gigapixel cameras u Getting ready for… –Trillions of observations –50+ PB of images –20+ PB database Astronomy

MySQL Conference & ExpoJacek Becla, SLAC Science & Petabytes NASA BaBar LHC LSST BaBar

MySQL Conference & ExpoJacek Becla, SLAC Science, Industry & Petabytes ? Google Yahoo! Microsoft AT&T Walmart EBay Facebook few others

MySQL Conference & ExpoJacek Becla, SLAC Scientific Analytics Today u Complex computations –100s of attributes per query u Iterative, successively more restrictive u Curiosity driven questions u 3 major query types –Needle in haystack –Correlations –Time series

MySQL Conference & ExpoJacek Becla, SLAC Hunt for Higgs Boson u Complex hierarchical tree-like structures with many relations u Events are uncorrelated Event TrackList TrackerCalor. Track Track Track Track Track HitList Hit Hit Hit Hit Hit Credit: Dirk Düllmann/CERN HEP: It’s All About “Events” Needle in haystack Spatial correlations Time series within event

MySQL Conference & ExpoJacek Becla, SLAC Untangling the Universe u Overlapping u Moving u Disappearing u Highly correlated Astronomy: It’s All About “Astronomical Objects” Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series Needle in haystack Spatial correlations Time series

MySQL Conference & ExpoJacek Becla, SLAC Understanding Dynamics of Biological Processes Needle in haystack Correlations Time series

MySQL Conference & ExpoJacek Becla, SLAC Future Scientific Analytics u Seamless integration with raw data u Annotation and sharing u Ubiquitous scientific data analytics –Instead of analytics for elite scientists u Mobile anytime anywhere –On open source data

MySQL Conference & ExpoJacek Becla, SLAC Industry & Analytics u Most queries tool-generated u Lots of summaries and aggregates u Some very complex analytics –detecting fraudent activities –understanding hacker patterns –correlating ads with user behaviors u Starting to realize huge potential of data/logs Needle in haystack Correlations Time series Industrial analytics are becoming increasingly more complex

MySQL Conference & ExpoJacek Becla, SLAC Scientific Approach to Petascale Analytics u Relational model insufficient u ODBMS didn’t take off u Files + metadata in db u Custom software u Filtering & grouping –Avoids small-granularity random reads –Organized activity, introduces delay u RDBMS – good match –but no multi-server setups yet u Bigger systems –files + metadata in db u Raw data in files –…or blobs inside database HEP Others

MySQL Conference & ExpoJacek Becla, SLAC Industrial Approach to Petascale Analytics u Very few use databases for analytics u Trend: Map/Reduce paradigm –M/R, Hadoop, Dryad –Bigtable, HBase, Hypertable –Sawzall, Pig Latin, LINQ

MySQL Conference & ExpoJacek Becla, SLAC Database… Map/Reduce… Files + Database… Is it really so different?

MySQL Conference & ExpoJacek Becla, SLAC Maybe Not! You Must… u Manage lots of hardware u Learn to deal with failures u Parallelize u Optimize u Compromise u Automate

MySQL Conference & ExpoJacek Becla, SLAC Manage Lots of Hardware u 6 GB / min  100 MB/sec (1 disk) u 1 PB / min  150,000 disks

MySQL Conference & ExpoJacek Becla, SLAC Learn to Deal With Failures Large number of disks = Large number of nodes = Constant state of failures = Must recover transparently –don't think RAID or high-end hardware will save you Treat failures as normal state, not exceptions

MySQL Conference & ExpoJacek Becla, SLAC Optimize or Go Bankrupt u How to organize data? u What to save? What to re-compute? u How to partition? u Row or column store? u What to index? u CPU/disk balance? u How much to compress? u How to formulate query?

MySQL Conference & ExpoJacek Becla, SLAC Compromise or Die u Performance killers –Transactions –Foreign keys

MySQL Conference & ExpoJacek Becla, SLAC Automate u 1 PB = –20 years of movies (HD) –2,000 years of MP3 (128 kbits/sec) u Too much data to browse or comprehend u Auto-load balance your data

MySQL Conference & ExpoJacek Becla, SLAC And Your Biggest Problem Is... power and cooling –tape is cool –flash disks are coming

MySQL Conference & ExpoJacek Becla, SLAC Hot or Not DBig, monolithic systems DShared all, shared disks DSpecialized hardware CLightweight, flexible specialized components with open interfaces CCommodity hardware CShared nothing

MySQL Conference & ExpoJacek Becla, SLAC Scale or Sophistication? sophistication scale Matlab, SAS DBMS Map/Reduce Overhead too big for small problems Uses resources inefficiently Schema inside code Costly scalability Progressively expensive fault tolerance Inflexible schema

MySQL Conference & ExpoJacek Becla, SLAC What Is Next? sophistication scale Map/Reduce Adding Schema SQL (hive) More indexes New, more scalable engines Brand new DBMSes Planning to scale DBMS Matlab, SAS

MySQL Conference & ExpoJacek Becla, SLAC Database Features Needed u Scalability up to 100s of petabytes (higher tomorrow) u Parallelized single queries on commodity hardware u Fault tolerant with intra-query failover u Procedural user-defined functions/stored procedures that could be executed in parallel u Shared scans u Partial results u Query pause/restart/abort u Pre-execution query cost estimate u Resource management system u Support for arrays as a first-class column type u Support for provenance of data elements u Support for uncertainty of data elements u Support for spatial and temporal operations Scientific Point of View 

MySQL Conference & ExpoJacek Becla, SLAC Will They Be There for LSST? u Convincing database camp to build it all for us –Working with many u Testing pure Map/Reduce + Bigtable –Collaborating with Google u Prototyping with custom software plus off-the-shelf RDBMS –Using MySQL 2009 – choosing technology 2010 – 2014: construction 2014 – 2023: production

MySQL Conference & ExpoJacek Becla, SLAC Summary u Data avalanche u Need scalable, sophisticated tools u You are facing it too Credit: ncids.org