Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead.

Slides:



Advertisements
Similar presentations
Schedule and Effort. Planning Big Project: Waterfall-ish Style 1.Figure out what the project entails Requirements, architecture, design 2.Figure out dependencies.
Advertisements

Workflow automation for processing plasma fusion simulation data Norbert Podhorszki Bertram Ludäscher Scientific Computing Group Oak Ridge National Laboratory.
1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Christopher Jeffers August 2012
Offline Performance Monitoring for Linux Abhishek Shukla.
1 port BOSS on Wenjing Wu (IHEP-CC)
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Presented by ORNL Statistics and Data Sciences Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
SDM Center End-to-end data management capabilities in the GPSC & CPES SciDAC’s: Achievements and Plans SDM AHM December 11, 2006 Scott A. Klasky End-to-End.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Fusion-SDM (1) Problem description –Each run in future: ¼ Trillion particles, 10 variables, 8 bytes –Each time step, generated every 60 sec is (250x10^^9)x8x10.
Presented by End-to-End Computing at ORNL Scott A. Klasky Scientific Computing National Center for Computational Sciences In collaboration with Caltech:
Byron Hood | version 0.4 Computer Systems Lab Project Sign Language Recognition using Webcams.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Your name here SPA: Successes, Status, and Future Directions Terence Critchlow And many, many, others Scientific Process Automation PNNL.
NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.
Running Genesis Free-Electron Laser Code on iDataPlex Dave Dunning 15 th January 2013.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
Jay Lofstead Input/Output APIs and Data Organization for High Performance Scientific Computing November.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Grid Remote Execution of Large Climate Models (NERC Cluster Grid) Dan Bretherton, Jon Blower and Keith Haines Reading e-Science Centre
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
SDM Center Experience with Fusion Workflows Norbert Podhorszki, Bertram Ludäscher Department of Computer Science University of California, Davis UC DAVIS.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Considering Time in Designing Large-Scale Systems for Scientific Computing Nan-Chen Chen 1 Sarah S. Poon 2 Lavanya Ramakrishnan 2 Cecilia R. Aragon 1,2.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Matrix Multiplication in Hadoop
Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Silberschatz, Galvin and Gagne ©2007 Chapter 0: Historical Overview.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
HPC In The Cloud Case Study: Proteomics Workflow
VisIt Project Overview
MASS Java Documentation, Verification, and Testing
ECRG High-Performance Computing Seminar
Building Analytics At Scale With USQL and C#
So far we have covered … Basic visualization algorithms
Haiyan Meng and Douglas Thain
SDM workshop Strawman report History and Progress and Goal.
TeraScale Supernova Initiative
Overview of Workflows: Why Use Them?
Emulator of Cosmological Simulation for Initial Parameters Study
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead

2 1. Code Performance  From , computing power for codes like GTC will go up 3 orders of magnitude!  2 Paths for Pscale computing for most simulations.  More physics. Larger problems.  Code Coupling.  My personal definition of leadership class computing.  “Simulation runs on >50% of cores, running for >10 hours.”  One ‘small’ simulation will cost $38,000 on a Pflop computer.  Science scales with processors.  XGC and GTC fusion simulations will run on 80% of cores for 80 hours ($400,000/simulation).

3 Data Generated.  MTF will be ~2 days.  Restarts contain critical information to replay the simulation at different times.  Typical Restarts = 1/10 of memory. Dumps every 1 hour. (Big 3 apps support this claim).  Analysis files dump every physical timestep. Typically every 5 minutes of simulation.  Analysis files vary. We estimate for ITER size simulations data output will be roughly 1GB/5 minutes.  DEMAND I/O < 5% of calculation.  Total simulation will potentially produce =1280TB + 960GB.  Need > (16* )/(3600 *.05) = 91GB/sec.  Asynchronous I/O is needed!!! (Big 3 apps (combustion, fusion, astro allow buffers).  Reduces I/O rate to (16* )/3600 = 4.5GB/sec. (with lower overhead).  Get the data off the HPC, and over to another system!  Produce HDF5 files on another system (too expensive for HPC system).

4 Workflow Automation is desperately needed. (with high-speed data-in-transit techniques).  Need to integrate Autonomics into workflows….  Need to make it easy for the scientists.  Need to make it fault tolerant/robust.

5 A few days in the life of Sim Scientist. Day 1 -morning.  8:00AM Get Coffee, Check to see if job is running.  Ssh into jaguar.ccs.ornl.gov (job 1)  Ssh into seaborg.nersc.gov (job 2) (this is running yea!)  Run gnuplot to see if run is going ok on seaborg. This looks ok.  9:00AM Look at data from old run for post processing.  Legacy code (IDL, Matlab) to analyze most data.  Visualize some of the data to see if there is anything interesting.  Is my job running on jaguar? I submitted this 4K processor job 2 days ago!  10:00AM scp some files from seaborg to my local cluster.  Luckily I only have 10 files (which are only 1 GB/file).  10:30AM first file appears on my local machine for analysis.  Visualize data with Matlab.. Seems to be ok.  11:30AM see that the second file had trouble coming over.  Scp the files over again… Dohhh

6 Day 1 evening.  1:00PM Look at the output from the second file.  Opps, I had a mistake in my input parameters.  Ssh into seaborg, kill job. Emacs the input, submit job.  Ssh into jaguar, see status. Cool, it’s running.  bbcp 2 files over to my local machine. (8 GB/file).  Gnuplot data.. This looks ok too, but still need to see more information.  1:30PM Files are on my cluster.  Run matlab on hdf5 output files. Looks good.  Write down some information in my notebook about the run.  Visualize some of the data. All looks good.  Go to meetings.  4:00PM Return from meetings.  Ssh into jaguar. Run gnuplot. Still looks good.  Ssh into seaborg. My job still isn’t running……  8:00PM Are my jobs running?  ssh into jaguar. Run gnuplot. Still looks good.  Ssh into seaborg. Cool. My job is running. Run gnuplot. Looks good this time!

7 And Later  4:00AM yawn… is my job on jaguar done?  Ssh into jaguar. Cool. Job is finished. Start bbcp files over to my work machine. (2 TB of data).  8:00AM Bbcp is having troubles. Resubmit some of my bbcp from jaguar to my local cluster.  8:00AM (next day). Opps still need to get the rest of my 200GB of data over to my machine.  3:00PM My data is finally here!  Run Matlab. Run Ensight. Oppps…. Something’s wrong!!!!!!!!! Where did that instability come from?  6:00PM finish screaming!

8 Need metadata integrated into the high-performance I/O, and integrated for simulation monitoring.  Typical Monitoring  Look at volume averaged quantities.  At 4 key times this quantity looks good.  Code had 1 error which didn’t appear in the typical ascii output to generate this graph.  Typically users run gnuplot/grace to monitor output.  More advanced monitoring  5 seconds move 600MB, and process the data. Really need to use FFT for 3D data, and then process data + particles 50 seconds (10 time steps) move & process data. 8 GB for 1/100 of the 30 billion particles. Demand low overhead <5%!

9 Parallel Data Analysis.  Most applications use scalar data analysis.  IDL  Matlab.  Ncar graphics.  Need techniques such as PCA  Need help, since data analysis is written quickly, and changed often… No harden versions…. Maybe….

10 Statistical Decomposition of Time Varying Simulation Data  Transform to reduce non-linearity in distribution (often density- based)  PCA computed via SVD (or ICA, FA, etc.)  Construction of component movies  Interpretation of spatial, time, and movie components  Pairs of equal singular values indicate periodic motion G Ostrouchov: ETG GTC Simulation Data (Z. Lin, UCI and S. Klasky, ORNL) Decomposition shows transitions between wave components in time

11 New Visualization Challenges.  Finding the needle in the haystack.  Feature identification/tracking!  Analysis of 5D+time phase-space (with 1x10 12 ) particles!  Real-time visualization of codes during execution.  Debugging Visualization.

12 Where is my data?  ORNL, NERSC, HPSS (NERSC,ORNL), local cluster, laptop?  We need to keep track of multiple copies?  We need to query the data. Query based visualization methods.  Don’t want to distinguish between different disks/tapes.