Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.

Slides:



Advertisements
Similar presentations
An Introduction to the COGENT Modelling Environment 27 th International Conference of the Cognitive Science Society July 20 th, 2005 Stresa, Italy.
Advertisements

Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Scaling of the Community Atmospheric Model to ultrahigh resolution Michael F. Wehner Lawrence.
Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker.
Current Progress on the CCA Groundwater Modeling Framework Bruce Palmer, Yilin Fang, Vidhya Gurumoorthi, Computational Sciences and Mathematics Division.
Scientific Grand Challenges Workshop Series: Challenges in Climate Change Science and the Role of Computing at the Extreme Scale Warren M. Washington National.
Training Manual Aug Probabilistic Design: Bringing FEA closer to REALITY! 2.5 Probabilistic Design Exploring randomness and scatter.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.
Simultaneous Localization and Map Building System for Prototype Mars Rover CECS 398 Capstone Design I October 24, 2001.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Copyright 2004 David J. Lilja1 Measuring Computer Performance: A Practitioner’s Guide David J. Lilja Electrical and Computer Engineering University of.
Quakefinder : A Scalable Data Mining System for detecting Earthquakes from Space A paper by Paul Stolorz and Christopher Dean Presented by, Naresh Baliga.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Independent Component Analysis (ICA) A parallel approach.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Enviromatics Environmental sampling Environmental sampling Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
EBEx foregrounds and band optimization Carlo Baccigalupi, Radek Stompor.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Using IOR to Analyze the I/O Performance
Exascale climate modeling 24th International Conference on Parallel Architectures and Compilation Techniques October 18, 2015 Michael F. Wehner Lawrence.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
Tracking Turbulent 3D Features Lu Zhang Nov. 10, 2005.
Smart Home Technologies
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
Cosmic Microwave Background Data Analysis At NERSC Julian Borrill with Christopher Cantalupo Theodore Kisner.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Overview Thomas E. Cheatham III University of Utah Jan 14th 2010 By Ross C. Walker.
1 Tom Edgar’s Contribution to Model Reduction as an introduction to Global Sensitivity Analysis Procedure Accounting for Effect of Available Experimental.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
A Study in Hadoop Streaming with Matlab for NMR data processing Kalpa Gunaratna1, Paul Anderson2, Ajith Ranabahu1 and Amit Sheth1 1Ohio Center of Excellence.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Planck working group 2.1 diffuse component separation review Paris november 2005.
SCEC Capability Simulations on TeraGrid
WP2 - INERTIA Distributed Multi-Agent Based Framework
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Locality-driven High-level I/O Aggregation
Data Issues Julian Borrill
Cloud Distributed Computing Environment Hadoop
SDM workshop Strawman report History and Progress and Goal.
Metadata Development in the Earth System Curator
Department of Computer Science, University of Tennessee, Knoxville
Extreme-Scale Distribution-Based Data Analysis
Gaurab KCa,b, Zachary Mitchella,c and Sarat Sreepathia
Presentation transcript:

Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory In collaboration with: Lenny Oliker David Skinner Mark Howison Nick Wright Noel Keen John Shalf Karen Karavanic

Explosion of sensor & simulation data make I/O a critical component Petascale I/O requires new techniques: analysis, visualization, diagnosis Statistical methods can be revealing Present case studies and optimization results for: MADbench – A cosmology application GCRM – A climate simulation Parallel I/O Evaluation and Analysis 2

IPM-I/O is an interposition library that wraps I/O calls with tracing instructions job output input trace IPM-I/O Job trace Read I/OBarrierWrite I/O 3

Events to Ensembles The details of a trace can obscure as much as they reveal And it does not scale Task 0 Task 10,000 Wall clock time Statistical methods reveal what the trace obscures And it does scale count

Case Study #1: MADCAP analyzes the Cosmic Microwave Background radiation. Madbench – An out-of-core matrix solver writes and reads all of memory multiple times.

CMB Data Analysis time domain - O(10 12 ) pixel sky map - O(10 8 ) angular power spectrum - O(10 4 )

 MADCAP is the maximum likelihood CMB angular power spectrum estimation code  MADbench is a lightweight version of MADCAP  Out-of-core calculation due to large size and number of pix-pix matrices MADbench Overview

Computational Structure I. Compute, Write (Loop) III. Read, Compute, Write (Loop) IV. Read, Compute/Communic ate (Loop) II. Compute/Communicate (no I/O) The compute intensity can be tuned down to emphasize I/O task wall clock time

MADbench I/O Optimization wall clock time task Phase II. Read #45678

MADbench I/O Optimization count duration (seconds)

MADbench I/O Optimization duration (seconds) Cumulative Probability A statistical approach revealed a systematic pattern

MADbench I/O Optimization Lustre patch eliminated slow reads Time Process# Before After

Case Study #2: Global Cloud Resolving Model (GCRM) developed by scientists at CSU Runs resolutions fine enough to simulate cloud formulation and dynamics Mark Howison’s analysis fixed it

GCRM I/O Optimization Wall clock time Task 0 Task 10,000 At 4km resolution GCRM is dealing with a lot of data. The goal is to work at 1km and 40k tasks, which will require 16x as much data. desired checkpoint time

GCRM I/O Optimization Worst case 20 sec Insight: all 10,000 are happening at once

GCRM I/O Optimization Worst case 3 sec Collective buffering reduces concurrency

GCRM I/O Optimization Before After desired checkpoint time

GCRM I/O Optimization Still need better worst case behavior Insight: Aligned I/O Worst case 1 sec

GCRM I/O Optimization Before After desired checkpoint time

GCRM I/O Optimization Sometimes the trace view is the right way to look at it Metadata is being serialized through task 0

GCRM I/O Optimization Defer metadata ops so there are fewer and they are larger

GCRM I/O Optimization Before desired checkpoint time After

Conclusions and Future Work Traces do not scale, can obscure underlying features Statistical methods scale, give useful diagnostic insights into large datasets Future work: gather statistical info directly in IPM Future work: Automatic recognition of model and moments within IPM

Acknowledgements Julian Borrill wrote MADCAP/MADbench Mark Howison performed the GCRM optimizations Noel Keen wrote the I/O extensions for IPM Kitrick Sheets (Cray) and Tom Wang (SUN/Oracle) assisted with the diagnosis of the Lustre bug This work was funded in part by the DOE Office of Advanced Scientific Computing Research (ASCR) under contract number DE-C02-05CH11231