Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Large Scale Computing Systems
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
Agile Views in a Dynamic Data Management System Oliver Kennedy 1+3, Yanif Ahmad 2, Christoph Koch 1
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Software checking: the performance gap Nils Klarlund Lucent Technologies Bell Labs.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Petascale Data Intensive Computing Alex Szalay The Johns Hopkins University.
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.
Petascale Data Intensive Computing for eScience Alex Szalay, Maria Nieto-Santisteban, Ani Thakar, Jan Vandenberg, Alainna Wonders, Gordon Bell, Dan Fay,
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
User Patterns from SkyServer 1/f power law of session times and request sizes –No discrete classes of users!! Users are willing to learn SQL for advantage.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Mining Large Data at SDSC Natasha Balac, Ph.D.. A Deluge of Data Astronomy Life Sciences Modeling and Simulation Data Management and Mining Geosciences.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Energy Issues in Data Analytics Domenico Talia Carmela Comito Università della Calabria & CNR-ICAR Italy
Lecture 11: DMBS Internals
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
Copyright 2009 Fujitsu America, Inc. 0 Fujitsu PRIMERGY Servers “Next Generation HPC and Cloud Architecture” PRIMERGY CX1000 Tom Donnelly April
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
DISTRIBUTED COMPUTING
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Making the Sky Searchable: Automatically Organizing the World’s Astronomical Data Sam Roweis, Dustin Lang &
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations.
On the Need for a Data Decadal Survey Stanley C. Ahalt, Ph.D. Director, RENCI Professor, UNC-Chapel Hill Dept. of Computer Science Director, Biomedical.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),
Extreme Data-Intensive Scientific Computing Alex Szalay The Johns Hopkins University.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Storage Systems CSE 598d, Spring 2007 Lecture ?: Rules of thumb in data engineering Paper by Jim Gray and Prashant Shenoy Feb 15, 2007.
Modern Information Retrieval
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
CMB & LSS Virtual Research Community Marcos López-Caniego Enrique Martínez Isabel Campos Jesús Marco Instituto de Física de Cantabria (CSIC-UC) EGI Community.
Amdahl’s Laws and Extreme Data-Intensive Computing Alex Szalay The Johns Hopkins University.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Societal applications of large scalable parallel computing systems ARTEMIS & ITEA Co-summit, Madrid, October 30th 2009.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Extreme Data-Intensive Computing in Science Alex Szalay The Johns Hopkins University.
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Data Warehousing and Data Mining
Performance And Scalability In Oracle9i And SQL Server 2000
Database System Architectures
Vrije Universiteit Amsterdam
Presentation transcript:

Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University

Gray’s Laws of Data Engineering Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working” DISSC: Data Intensive Scalable Scientific Computing

Astronomy Trends CMB Surveys (pixels) 1990 COBE Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck10 Million Galaxy Redshift Surveys (obj) 1986 CfA LCRS dF SDSS Angular Galaxy Surveys (obj) 1970 Lick 1M 1990 APM 2M 2005 SDSS200M 2009 PANSTARRS 1200M 2015 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs SNAP… LSST… Petabytes/year by the end of the decade…

Collecting Data Very extended distribution of data sets: data on all scales! Most datasets are small, and manually maintained (Excel spreadsheets) Total amount of data dominated by the other end (large multi-TB archive facilities) Most bytes today are collected via electronic sensors

Data Analysis Analysis also on all scales Data is everywhere, never will be at a single location More data intensive scalable architectures needed Need randomized, incremental algorithms –Best result in 1 min, 1 hour, 1 day, 1 week Today most scientific data analysis is done on small to midsize BeoWulf clusters, from faculty startup Not scalable, not maintainable National infrastructure focused on the top facilities

Amdahl’s Laws Gene Amdahl (1965): Laws for a balanced system i.Parallelism: max speedup is S/(S+P) ii.One bit of IO/sec per instruction/sec (BW) iii.One byte of memory per one instruction/sec (MEM) iv.One IO per 50,000 instructions (IO) Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006) For a Blue Gene the BW=0.001, MEM=0.12. For the JHU cluster BW=0.5, MEM=1.04

Typical Amdahl Numbers

Commonalities Huge amounts of data, aggregates needed –But also need to keep raw data –Need for parallelism Requests enormously benefit from indexing Very few predefined query patterns –Everything goes…. search for the unknown!! –Rapidly extract small subsets of large data sets –Geospatial everywhere Data will never be in one place –Newest (and biggest) data are live, changing daily Fits DB quite well, but no need for transactions Simulations generate even more data

Amdahl Numbers for Data Sets Data Analysis

The Data Sizes Involved

Need Scalable Crawlers On Petascale data sets we need to partition queries –Queries executed on “tiles” or “tilegroups” –Databases can offer indexes, fast joins –Partitions can be invisible to users, or directly addressed for extra flexibility (spatial) MapReduce/Hadoop –Non-local queries need an Exchange step Also need multiTB shared data-space

Emerging Trends for DISSC Large data sets are here, solutions are not National Infrastructure does not match power law Even HPC projects choking on IO Scientists are “cheap”, also pushing to the limit Sociological trends: –Data collection in ever larger collaborations (VO) –Analysis decoupled, off archived data by smaller groups Data will be never co-located –“Ferris-wheel”/streaming algorithms (w. B. Grossman) –“Data pipes” for distributed workflows (w. B. Bauer) –“Data diffusion” (w I. Foster and I. Raicu)

Continuing Growth How long does the data growth continue? High end always linear Exponential comes from technology + economics  rapidly changing generations –like CCD’s replacing plates, gene sequencers…. How many new generations of instruments left? Where are new growth areas emerging? Software is also an instrument: –simulations –Virtual data (materialized) –data cloning

Summary Data growing exponentially Data on all scales Computational problems on all scales Need a fractal architecture for scientific computing Data will never be co-located Bring computations as close to the data as possible Need an objective measure for DISSC systems –Amdahl number!