HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 1 HPC User Forum 9/10/2008 Scott Klasky S. Ethier, S. Hodson,

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Architecture and Implementation of Lustre at the National Climate Computing Research Center Douglas Fuller National Climate Computing Research Center /
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
GTC Status: Physics Capabilities & Recent Applications Y. Xiao for GTC team UC Irvine.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
New Challenges in Cloud Datacenter Monitoring and Management
Turbulent transport in collisionless plasmas: eddy mixing or wave-particle decorrelation? Z. Lin Y. Nishimura, I. Holod, W. L. Zhang, Y. Xiao, L. Chen.
Scott Klasky SDM Integration Framework in the Hurricane of Data SDM AHM 10/07/2008 Scott A. Klasky ANL: Ross CalTech: Cummings GT: Abbasi,
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
Presented by XGC: Gyrokinetic Particle Simulation of Edge Plasma CPES Team Physics and Applied Math Computational Science.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Extending Petascale I/O with Data Services Hasan Abbasi Karsten Schwan Matthew Wolf Jay Lofstead Scott Klasky (ORNL)
Presented by Gyrokinetic Particle Simulations of Fusion Plasmas Scott A. Klasky Scientific Computing National Center for Computational Sciences.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Workshop on the Future of Scientific Workflows Break Out #2: Workflow System Design Moderators Chris Carothers (RPI), Doug Thain (ND)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.
The european ITM Task Force data structure F. Imbeaux.
Presented by End-to-End Computing at ORNL Scott A. Klasky Scientific Computing National Center for Computational Sciences In collaboration with Caltech:
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Server to Server Communication Redis as an enabler Orion Free
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
ESMF/V3: Managed by UT-Battelle for the Department of Energy.
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Jay Lofstead Input/Output APIs and Data Organization for High Performance Scientific Computing November.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Supercomputing 2006 Scientific Data Management Center Lead Institution: LBNL; PI: Arie Shoshani Laboratories: ANL, ORNL, LBNL, LLNL, PNNL Universities:
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Tackling I/O Issues 1 David Race 16 March 2010.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
FPT Discussions on Current Research Topics Z. Lin University of California, Irvine, California 92697, USA.
ADIOS – adiosapi.org1 Jay Lofstead Flexible IO and Integration for Scientific Codes Through The Adaptable IO System (ADIOS) Jay Lofstead (GT),
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
HPC In The Cloud Case Study: Proteomics Workflow
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Spark Presentation.
Grid Computing.
TeraScale Supernova Initiative
MFE Simulation Data Management
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 1 HPC User Forum 9/10/2008 Scott Klasky S. Ethier, S. Hodson, C. Jin, Z. Lin, J. Lofstead, R. Oldfeld, M. Parashar,K. Schwan, A. Shoshani, M. Wolf, Y. Xiao, F. Zheng

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 2  GTC  EFFIS  ADIOS.  Workflow.  Dashboard.  Conclusions.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy  Compute node (7,832; 35.2 GF/node) –1 socket (AM2/HT1) per node –4 cores per socket (31,328 cores total) –Core CPU: 2.2 GHz AMD Opteron –Memory per core: 2 GB (DDR2-800)  232 service & I/O nodes  Local storage: ~750 TB, 41 GB/s  Interconnect: 3D torus, SeaStar 2.1 NIC  Aggregate memory: 63 TB  Peak performance: 275 TF Compute node (13,888; 73.6 GF/node) –2 sockets per node (F/HT1) –4 cores per socket (111,104 cores total) –Core CPU: 2.3 GHz AMD Opteron –Memory per core: 2 GB (DDR2-800) 256 service & I/O nodes Local storage: ~10 PB, 200+ GB/s Interconnect: 3D torus, SeaStar 2.1 NIC Aggregate memory: 222 TB Peak performance: 1.0 PF 150 cabinets, 3400 ft MW power.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 4 Big Simulations for early 2008: GTC Science Goals and Impact Science Goals  Use GTC (classic) to analyze cascades and propagation in Collisionless Trapped Electron Mode (CTEM) turbulence –Resolve the critical question of ρ* scaling of confinement in large tokamaks such as ITER; what are consequences of departure from this scaling? –Avalanches and turbulence spreading tend to break Gyro-Bohm scaling but zonal flows tend to restore it by shearing apart extended eddies: a competition  Use GTC-S (shaped) to study electron temperature gradient (ETG) drift turbulence & compare against NSTX experiments –NSTX has a spherical torus with a very low major to minor radius aspect ratio and a strongly-shaped cross-section –NSTX exps have produced very interesting high frequency short wavelength modes - are these kinetic electron modes? –ETG is a likely candidate but only a fully nonlinear kinetic simulation with the exact shape & exp profiles can address this Science Impact  Further the understanding of CTEM turbulence by validation against modulated ECH heat pulse propagation studies on the DIII-D, JET & Tore Supra tokamaks –Is CTEM the key mechanism for electron thermal transport? –Electron temperature fluctuation measurements will shed light –Understand the role of nonlinear dynamics of precession drift resonance in CTEM turbulence  First-time for direct comparison between simulation & experiment on ETG drift turbulence –GTC-S possesses right geometry and right nonlinear physics to possibly resolve this –Help to pinpoint micro-turbulence activities responsible for energy loss through the electron channel in NSTX plasmas

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 5 GTC Early Application: electron microturbulence in Fusion Plasma “Scientific Discovery” - Transition to favorable scaling of confinement for both ions and electrons now observed in simulations for ITER plasmas Electron transport less understood but more important in ITER since fusion products first heat the electrons Simulation of electron turbulence is more demanding due to shorter time scales and smaller spatial scales Recent GTC simulation of electron turbulence used 28,000 cores for 42 hours in a dedicated run on Jaguar at ORNL producing 60 TB of data currently being analyzed. This run pushes 15 billion particles for 4800 major time cycles Good news for ITER! Good news for ITER! Ion transport Electron transport

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 6  3D fluid data analysis provides critical information to characterize microturbulence, such as radial eddy size, eddy auto-correlation time  Flux Surface Electrostatic Potential demonstrates a ballooning structure  Radial Turbulence eddies have average size ~ 5 ion gyroradius

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 7  From SDM center* –Workflow engine – Kepler –Provenance support –Wide-area data movement  From universities –Code coupling (Rutgers) –Visualization (Rutgers)  Newly developed technologies –Adaptable I/O (ADIOS) (with Georgia Tech) –Dashboard (with SDM center) Visualization Code Coupling Wide-area Data Movement Dashboard Workflow Adaptable I/O Provenance and Metadata Foundation Technologies Enabling Technologies Approach: place highly annotated, fast, easy-to-use I/O methods in the code, which can be monitored and controlled, have a workflow engine record all of the information, visualize this on a dashboard, move desired data to user’s site, and have everything reported to a database.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 8  GTC  EFFIS  ADIOS.  Conclusions.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 9  “Those fine fort.* files!”  Multiple HPC architectures –BlueGene, Cray, IB-based clusters  Multiple Parallel Filesystems –Lustre, PVFS2, GPFS, Panasas, PNFS  Many different APIs –MPI-IO, POSIX, HDF5, netCDF –GTC (fusion) has changed IO routines 8 times so far based on performance when moving to different platforms.  Different IO patterns –Restarts, analysis, diagnostics –Different combinations provide different levels of IO performance  Compensate for inefficiencies in the current IO infrastructures to improve overall performance

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 10  Allows plug-ins for different I/O implementations.  Abstracts the API from the method used for I/O.  Simple API, almost as easy as F90 write statement.  Best practices/optimize IO routines for all supported transports “for free”  Componentization.  Thin API  XML file –data groupings with annotation –IO method selection –buffer sizes  Common tools –Buffering –Scheduling  Pluggable IO routines External Metadata (XML file) Scientific Codes ADIOS API MPI-CIOLIVE/DataTapMPI-IOPOSIX IOpHDF-5pnetCDFViz EnginesOthers (plug-in) bufferingschedulefeedback

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 11  Simple API very similar to standard Fortran or C POSIX IO calls. –As close to identical as possible for C and Fortran API –open, read/write, close is the core –set_path, end_iteration, begin/end_computation, init/finalize are the auxiliaries  No changes in the API for different transport methods.  Metadata and configuration defined in an external XML file parsed once on startup. –Describe the various IO grouping including attributes and hierarchical path structures for elements as an adios-group –Define the transport method used for each adios-group and give parameters for communication/writing/reading –Change on a per element basis what is written –Change on a per adios-group basis how the IO is handled

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 12 ADIOS is an IO componentization, which allows us to –Abstract the API from the IO implementation. –Switch from synchronous to asynchronous IO at runtime. –Change from real-time visualization to fast IO at runtime. Combines. –Fast I/O routines. –Easy to use. –Scalable architecture (100s cores) millions of procs. –QoS. –Metadata rich output. –Visualization applied during simulations. –Analysis, compression techniques applied during simulations. –Provenance tracking.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 13  ADIOS Fortran and C based API almost as simple as standard POSIX IO  External configuration to describe metadata and control IO settings  Take advantage of existing IO techniques (no new native IO methods) Fast, simple-to-write, efficient IO for multiple platforms without changing the source code

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 14  Data groupings –logical groups of related items written at the same time.  Not necessarily one group per writing event  IO Methods –Choose what works best for each grouping –Vetted, improved, and/or written by experts for each  POSIX (Wei-keng Liao, Northwestern)  MPI-IO (Steve Hodson, ORNL)  MPI-IO Collective (Wei-keng Liao, Northwestern)  NULL (Jay Lofstead, GT)  Ga Tech DataTap Asynchronous (Hasan Abbasi, GT)  phdf5  others.. (pnetcdf on the way).

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 15  Specialty APIs –HDF-5 – complex API –Parallel netCDF – no structure  File system aware middleware –MPI ADIO layer – File system connection, complex API  Parallel File systems –Lustre – Metadata server issues –PVFS2 – client complexity –LWFS – client complexity –GPFS, pNFS, Panasas – may have other issues

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 16  Platforms tested –Cray CNL (ORNL Jaguar) –Cray Catamount (SNL Redstorm) –Linux Infiniband/Gigabit (ORNL Ewok) –BlueGene P now being tested/debugged. –Looking for future OSX support.  Native IO Methods –MPI-IO independent, MPI-IO collective, POSIX, NULL, Ga Tech DataTap asynchronous, Rutgers DART asynchronous, Posix-NxM, phdf5, pnetcdf, kepler-db

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 17  MPI-IO method. –GTC and GTS codes have achieved over 20 GB/sec on Cray XT at ORNL.  30GB diagnostic files every 3 minutes, 1.2 TB restart files every 30 minutes, 300MB other diagnostic files every 3 minutes.  DART: <2% overhead for writing 2 TB/hour with XGC code.  DataTap vs. Posix –1 file per process (Posix). –5 secs for GTC computation. –~25 seconds for Posix IO –~4 seconds with DataTap

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 18  June 7, 2008: 24 hour GTC run on Jaguar at ORNL –93% of machine (28,672 cores) –MPI-OpenMP mixed model on quad-core nodes (7168 MPI procs) –three interruptions total (simple node failure) with hour runs –Wrote 65 TB of data at >20 GB/sec (25 TB for post analysis) –IO overhead ~3% of wall clock time. –Mixed IO methods of synchronous MPI-IO and POSIX IO configured in the XML file

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 19 Chimera IO Performance (Supernova code) 2x scaling Plot minimum value from 5 runs with 9 restarts/run Error bars show maximum time for the method.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 20 Chimera Benchmark Results  Why ADIOS is better than pHDF5? ADIOS_MPI_IO vs. pHDF5 w/ MPI Indep. IO driver ADIOS_MPI_IO Function# of callsTime write MPI_File_open MPI_Recv buffer_write fopen bp_calsize_stringtag other--~40 pHDF5 Function# of callsTime write MPI_Bcast(sync) MPI_File_open MPI_File_set_size MPI_Comm_dup H5P,H5D,etc other--~20 Use 512 cores, 5 restart dumps. Conversion time on 1 processor for the 2048 core job = 3.6s (read) + 5.6s (write) (other) = 18.8 s Number above are sum among all PEs (parallelism not shown)

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 21  A research transport to study asynchronous data movement  Uses server directed I/O to maintain high bandwidth, low overhead for data extraction  I/O scheduling is performed to the perturbation caused by asynchronous I/O

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 22  Due to perturbations caused by asynchronous I/O, the overall performance of the application may actually get worse  We schedule the data movement using application state information to prevent asynchronous I/O from interfering with MPI communication  800 GB of data. – Schedule I/O takes 2x longer to move data. Overhead is 2x less.

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 23 XML configuration file: … Fortan90 code: ! initialize the system loading the configuration file adios_init (“config.xml”, err) ! open a write path for that type adios_open (h1, “output”, “restart.n1”, “w”, err) adios_group_size (h1, size, total_size, comm, err) ! write the data items adios_write (h1, “g_NX”, 1000, err) adios_write (h1, “g_NY”, 800, err) adios_write (h1, “lo_x”, x_offset, err) adios_write (h1, “lo_y”, y_offset, err) adios_write (h1, “l_NX”, x_size, err) adios_write (h1, “l_NY”, y_size, err) adios_write (h1, “temperature”, u, err) ! commit the writes for asynchronous transmission adios_close (h1, err) … ! do more work ! shutdown the system at the end of my run adios_finalize (mype, err)

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 24 C code: // parse the XML file and determine buffer sizes adios_init (“config.xml”); // open and write the retrieved type adios_open (&h1, “restart”, “restart.n1”, “w”); adios_group_size (h1, size, &total_size, comm); adios_write (h1, “n”, n); // int n; adios_write (h1, “mi”, mi); // int mi; adios_write (h1, “zion”, zion); // float zion [10][20][30][40]; // write more variables... // commit the writes for synchronous transmission or // generally initiate the write for asynchronous transmission adios_close (h1); // do more work... // shutdown the system at the end of my run adios_finalize (mype); XML configuration file: … … srv=ewok001.ccs.ornl.gov

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 25  Petascale GTC runs will produce 1PB per simulation.  Couple GTC with Edge core (core-edge coupling). –4 PB of data per run. –Can’t store all of GTC runs at ORNL unless we go to tape. ( 12 days to grab data from tape if we get 1GB/sec). –1.5 FTE looking at the the data. –Need more ‘real-time’ analysis of data. –Workflows, data-in-transit (IO graphs), …?  Can we create a staging area with “fat-nodes” –Move data from computational nodes to fat nodes using network of HPC resource. –Reduce data on fat-nodes. –Allow users to “plug-in” analysis routines on “fat-nodes” –How Fat?  Shared memory helps (don’t have to paralyze parallelize-all analysis codes.  Typical upper bound of codes we studied write 1/20 th of memory/core for analysis. Want 1/20 th of resources (5% overhead). Need 2x memory per core for analysis (2x overhead for memory we need (in data + out data).  On Cray at ORNL this means we will have roughly 750 sockets (quad core) for fat memory with shared memory of 34 GB of shared memory.  Also useful for codes which require memory but not as many nodes.  Can we have shared memory on this portion?  What are the other solutions?

HPC User Forum 9/10/08 Managed by UT-Battelle for the Department of Energy 26  GTC is a code which is scaling to the petascale computers.  GBP, Cray XT.  New changes are new science and new IO (ADIOS).  Major challenge in the future is speeding up the data analysis.  ADIOS is an IO componentization. –ADIOS is being integrated integrated into Kepler. –Achieved over 50% peak IO performance for several codes on Jaguar. –Can change IO implementations at runtime. –Metadata is contained in XML file. –Petascale science starts with petascale applications. –Need enabling technologies to scale. –Need to rethink ways to do science.