Ian Foster Computation Institute Argonne National Lab & University of Chicago Towards an Open Analytics Environment (A “Data Cauldron”)

Slides:



Advertisements
Similar presentations
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Advertisements

Ian Foster Computation Institute Argonne National Lab & University of Chicago Services for Science.
Microsoft Research Faculty Summit Ian Foster Computation Institute University of Chicago & Argonne National Laboratory.
Application of GRID technologies for satellite data analysis Stepan G. Antushev, Andrey V. Golik and Vitaly K. Fischenko 2007.
Presenter: Sora Choe.  Introduction…………………………………3~7  Requirements……………………………..8~12  Implementation…………………………13~17  Microbenchmarks Performance……..18~23.
AME: An Any-scale many-task computing Engine Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Bridge the gap between HPC and HTC Applications structured as DAGs Data dependencies will be files that are written to and read from a file system Loosely.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Using Globus to Scale an Application Case Study 4: Scientific Workflow for Computational Economics Tiberiu Stef-Praun, Gabriel Madeira, Ian Foster, Robert.
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Sponsored by the National Science Foundation GENI and Cloud Computing Niky RIga GENI Project Office
Emalayan Vairavanathan
DISTRIBUTED COMPUTING
Ian Foster Computation Institute Argonne National Lab & University of Chicago From the Heroic to the Logistical Programming Model Implications of New Supercomputing.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Ian Foster Computation Institute Argonne National Lab & University of Chicago From the heroic to the logistical Non-traditional applications for parallel.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Ian Foster on behalf of the Globus Alliance Computation Institute Argonne National Lab & University of Chicago Globus: State of the Union.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
UKQCD QCDgrid Richard Kenway. UKQCD Nov 2001QCDgrid2 why build a QCD grid? the computational problem is too big for current computers –configuration generation.
Inside your computer. Hardware Review Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Inside your computer. Hardware Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
Case Studies in Storage Access by Loosely Coupled Petascale Applications Justin M Wozniak and Michael Wilde Petascale Data Storage Workshop at SC’09 Portland,
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
General Computer Stuff Hardware: physical parts of a computer: CPU, drives, etc. Software: Programs and Data A computer needs both to be useful.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Application Hosting Services — Enabling Science 2.0 —
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Enhancements to Galaxy for delivering on NIH Commons
Compute and Storage For the Farm at Jlab
HPC In The Cloud Case Study: Proteomics Workflow
Accessing the VI-SEEM infrastructure
Organizations Are Embracing New Opportunities
Data Management at the Advanced Photon source (APS)
Tools and Services Workshop
Joslynn Lee – Data Science Educator
From the Heroic to the Logistical Programming Model Implications of New Supercomputing Applications Ian Foster Computation Institute Argonne National.
Bernd Panzer-Steindel, CERN/IT
Recap: introduction to e-science
VI-SEEM Data Repository
Computer Hardware Introduction.
Kirill Lozinskiy NERSC Storage Systems Group
USF Health Informatics Institute (HII)
HII Technical Infrastructure
Overview Introduction VPS Understanding VPS Architecture
Haiyan Meng and Douglas Thain
SDM workshop Strawman report History and Progress and Goal.
Clouds from FutureGrid’s Perspective
TeraScale Supernova Initiative
SAP HANA Cost-optimized Hardware for Non-Production
Overview of Workflows: Why Use Them?
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Development of LHCb Computing Model F Harris
Presentation transcript:

Ian Foster Computation Institute Argonne National Lab & University of Chicago Towards an Open Analytics Environment (A “Data Cauldron”)

2 An Open Analytics Environment Results out Data in Programs & rules in “No limits”  Storage  Computing  Format  Program Allowing for  Versioning  Provenance  Collaboration  Annotation

3 Towards an Open Analysis Environment: (1) Applications l Astrophysics l Cognitive science l East Asian studies l Economics l Environmental science l Epidemiology l Genomic medicine l Neuroscience l Political science l Sociology l Solid state physics

4 Towards an Open Analysis Environment: (2) Hardware SiCortex 6K cores, 6 Top/s IBM BG/P 160K cores, 500 Top/s PADS PADS Gbit/s

5 PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup

6 Towards an Open Analysis Environment : (3) Methods l HPC systems software (MPICH, PVFS, ZeptOS) l Collaborative data tagging (GLOSS) l Data integration (XDTM) l HPC data analytics and visualization l Loosely coupled parallelism (Swift, Hadoop) l Dynamic provisioning, data diffusion (Falkon) l Service authoring (Introduce, caGrid, gRAVI) l Provenance recording and query (Swift) l Service composition and workflow (Taverna) l Virtualization management (Nimbus) l Distributed data management (GridFTP, etc.)

7 Towards an Open Analysis Environment : (3) Methods l HPC systems software (MPICH, PVFS, ZeptOS) l Collaborative data tagging (GLOSS) l Data integration (XDTM) l HPC data analytics and visualization l Loosely coupled parallelism (Swift, Hadoop) l Dynamic provisioning, data diffusion (Falkon) l Service authoring (Introduce, caGrid, gRAVI) l Provenance recording and query (Swift) l Service composition and workflow (Taverna) l Virtualization management (Nimbus) l Distributed data management (GridFTP, etc.)

8 High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao

9./group23 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC./group23/AA: drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa. /group23/AA/04nov06aa: drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users Dec 5 11:40 FUNCTIONAL. /group23/AA/04nov06aa/ANATOMY: -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users Nov 5 12:29 coplanar.img. /group23/AA/04nov06aa/FUNCTIONAL: -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users Nov 5 12:32 bold1_0003.img XDTM: XML Data Typing & Mapping Logical Physical

10 Tagging & Social Networking GLOSS: Generalized Labels Over Scientific data Sources Svetlozar Nestorov and others

11 start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligandscomplexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC PDB protein descriptions Select best ~5K For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) (Mike Kubal, Benoit Roux, and others)

12 DOCK on BG/P: ~1M Tasks on 118,000 CPUs l CPU cores: l Tasks: l Elapsed time: 7257 sec l Compute time: CPU years l Average task time: 667 sec l Relative Efficiency: 99.7% l (from 16 to 32 racks) l Utilization: u Sustained: 99.6% u Overall: 78.3% GPFS 1 script (~5KB) 2 file read (~10KB) 1 file write (~10KB) RAM (cached from GPFS on first task per node) 1 binary (~7MB) Static input data (~45MB) Ioan Raicu Zhao Zhang Mike Wilde Time (secs)

13 Managing 160,000 Cores Slower shared storage High-speed local “disk” Falkon

14 Efficiency (relative to no-IO case) for 4 second tasks and data sizes 1KB to 1MB for CIO and GPFS up to 32K processors

15 “MI” workload, 250K tasks, 10MB:10ms ratio, up to 64 nodes using DRP, GCC policy, 2GB caches/node

16 “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node

17 SW workload, 2M tasks, 10MB:10ms ratio, up to 100 nodes with DRP, GCC policy, 50GB caches/node

18