O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

Slides:

Advertisements

Similar presentations

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 DOE Evaluation of the Cray X1 Mark R. Fahey James B. White III (Trey) Center for Computational.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

HPC - High Performance Productivity Computing and Future Computational Systems: A Research Engineer’s Perspective Dr. Robert C. Singleterry Jr. NASA Langley.

SOS7 -- Crystal Ball or a random walk through Mike's brain Mike Merrill March 6, 2003.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.

History of Distributed Systems Joseph Cordina

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Evaluation of Memory Consistency Models in Titanium.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Center for Computational Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Vision for OSC Computing and Computational Sciences

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Cray Environmental Industry Solutions Per Nyberg Earth Sciences Business Manager Annecy CAS2K3 Sept 2003.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Performance Tuning John Black CS 425 UNR, Fall 2000.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 State of the CCS SOS 8 April 13, 2004 James B. White.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Data Requirements for Climate and Carbon Research John Drake, Climate Dynamics Group Computer.

Presented by LCF Climate Science Computational End Station James B. White III (Trey) Scientific Computing National Center for Computational Sciences Oak.

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Background Computer System Architectures Computer System Software.

Vector computers.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Morgan Kaufmann Publishers

Software Practices for a Performance Portable Climate System Model

CSCI1600: Embedded and Real Time Software

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Panel on Research Challenges in Big Data

Parallel Programming in C with MPI and OpenMP

CSCI1600: Embedded and Real Time Software

Presentation transcript:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey)

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 2 Disclaimer  The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 3 Disclaimer (cont.)  Graph-free, chart-free environment  For graphs and charts

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences Real TF on Cray Xn  Who needs capability computing?  Application requirements  Why Xn?  Laundry, Clean and Otherwise  Rants  Custom vs. Commodity  MPI  CAF  Cray

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 5 Who needs capability computing?  OMB?  Politicians?  Vendors?  Center directors?  Computer scientists?

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 6 Who needs capability computing?  Application scientists  According to scientists themselves

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 7 Personal Communications  Fusion  General Atomics, Iowa, ORNL, PPPL, Wisconsin  Climate  LANL, NCAR, ORNL, PNNL  Materials  Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin  Biology  NCI, ORNL, PNNL  Chemistry  Auburn, LANL, ORNL, PNNL  Astrophysics  Arizona, Chicago, NC State, ORNL, Tennessee

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 8 Scientists Need Capability  Climate scientists need simulation fidelity to support policy decisions  All we can say now is that humans cause warming  Fusion scientists need to simulate fusion devices  All we can do now is model decoupled subprocesses at disparate time scales  Materials scientists need to design new materials  Just starting to reproduce known materials

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 9 Scientists Need Capability  Biologists need to simulate proteins and protein pathways  Baby steps with smaller molecules  Chemists need similar increases in complexity  Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times)  Low-res, 3D CFD, approximate 3D neutrinos, short times

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 10 Why Scientists Might Resist  Capacity also needed  Software isn’t ready  Coerced to run capability-sized jobs on inappropriate systems

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 11 Capability Requirements  Sample DOE SC applications  Climate: POP, CAM  Fusion: AORSA, Gyro  Materials: LSMS, DCA-QMC

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 12 Parallel Ocean Program (POP)  Baroclinic  3D, nearest neighbor, scalable  Memory-bandwidth limited  Barotropic  2D implicit system, latency bound  Ocean-only simulation  Higher resolution  Faster time steps  As ocean component for CCSM  Atmosphere dominates

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 13 Community Atmospheric Model (CAM)  Atmosphere component for CCSM  Higher resolution?  Physics changes, parameterization must be retuned, model must be revalidated  Major effort, rare event  Spectral transform not dominant  Dramatic increases in computation per grid point  Dynamic vegetation, carbon cycle, atmospheric chemistry, …  Faster time steps

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 14 All-Orders Spectral Algorithm (AORSA)  Radio-frequency fusion-plasma simulation  Highly scalable  Dominated by ScaLAPACK  Still in weak-scaling regime  But…  Expanded physics reducing ScaLAPACK dominance  Developing sparse formulation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 15 Gyro  Continuum gyrokinetic simulation of fusion-plasma microturbulence  1D data decomposition  Spectral method - high communication volume  Some need for increased resolution  More iterations

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 16 Locally Self-Consistent Multiple Scattering (LSMS)  Calculates electronic structure of large systems  One atom per processor  Dominated by local DGEMM  First real application to sustain a TF  But… moving to sparse formulation with a distributed solve for each atom

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 17 Dynamic Cluster Aproximation (DCA-QMC)  Simulates high-temp superconductors  Dominated by DGER (BLAS2)  Memory-bandwidth limited  Quantum Monte Carlo, but…  Fixed start-up per process  Favors fewer, faster processors  Needs powerful processors to avoid parallelizing each Monte-Carlo stream

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 18 Few DOE SC Applications  Weak-ish scaling  Dense linear algebra  But moving to sparse

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 19 Many DOE SC Applications  “Strong-ish” scaling  Limited increase in gridpoints  Major increase in expense per gridpoint  Major increase in time steps  Fewer, more-powerful processors  High memory bandwidth  High-bandwidth, low-latency communication

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 20 Why X1?  “Strong-ish” scaling  Limited increase in gridpoints  Major increase in expense per gridpoint  Major increase in time steps  Fewer, more-powerful processors  High memory bandwidth  High-bandwidth, low-latency communication

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 21 Tangent: Strongish* Scaling  Firm  Semistrong  Unweak  Strongoidal  MSTW (More Strong Than Weak)  JTSoS (Just This Side of Strong)  WNS (Well-Nigh Strong)  Seak, Steak, Streak, Stroak, Stronk  Weag, Weng, Wong, Wrong, Twong * Greg Lindahl, Vendor Scum

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 22 X1 for 100 TF Sustained?  Uh, no  OS not scalable, fault-resilient enough for 10 4 processors  That “price/performance” thing  That “power & cooling” thing

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 23 Xn for 100 TF Sustained  For DOE SC applications, YES  Most-promising candidate -or-  Least-implausible candidate

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 24 Why X, again?  Most-powerful processors  Reduce need for scalability  Obey Amdahl’s Law  High memory bandwidth  See above  Globally addressable memory  Lowest, most hide-able latency  Scale latency-bound applications  High interconnect bandwidth  Scale bandwidth-bound applications

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 25 The Bad News  Scalar performance  “Some tuning required”  Ho-hum MPI latency  See Rants

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 26 Scalar Performance  Compilation is slow  Amdahl’s Law for single processes  Parallelization -> Vectorization  Hard to port GNU tools  GCC? Are you kidding?  GCC compatibility, on the other hand…  Black Widow will be better

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 27 “Some Tuning Required”  Vectorization requires:  Independent operations  Dependence information  Mapping to vector instructions  Applications take a wide spectrum of steps to inhibit this  May need a couple of compiler directives  May need extensive rewriting

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 28 Application Results  Awesome  Indifferent  Recalcitrant  Hopeless

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 29 Awesome Results  256-MSP X1 already showing unique capability  Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency  POP, Gyro, DCA-QMC, AGILE- BOLTZTRAN, VH1, Amber, …  Many examples from DoD

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 30 Indifferent Results  Cray X1 is brute-force fast, but not cost effective  Dense linear algebra  Linpack, AORSA, LSMS

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 31 Recalcitrant Results  Inherent algorithms are fine  Source code or ongoing code mods don’t vectorize  Significant code rewriting done, ongoing, or needed  CLM, CAM, Nimrod, M3D

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 32 Aside: How to Avoid Vectorization  Use pointers to add false dependencies  Put deep call stacks inside loops  Put debug I/O operations inside compute loops  Did I mention using pointers?

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 33 Aside: Software Design  In general, we don’t know how to systematically design efficient, maintainable HPC software  Vectorization imposes constraints on software design  Bad: Existing software must be rewritten  Good: Resulting software often faster on modern superscalar systems  “Some tuning required” for X series  Bad: You must tune  Good: Tuning is systematic, not a Black Art  Vectorization “constraints” may help us develop effective design patterns for HPC software

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 34 Hopeless Results  Dominated by unvectorizable algorithms  Some benchmark kernels of questionable relevance  No known DOE SC applications

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 35 Summary  DOE SC scientists do need 100 TF and beyond of sustained application performance  Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 36 “Custom” Rant  “Custom vs. Commodity” is Red Herring  CMOS is commodity  Memory is commodity  Wires are commodity  Cooling is independent of vector vs. scalar  PNNL liquid-cooling clusters  Vector systems may move to air-cooling  All vendors do custom packaging  Real issue: Software

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 37 MPI Rant  Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)”  Not ping pong!  An excellent abstraction that is imminently optimizable  Some apps are limited by point-to-point  Remote load/store implementations (CAF, UPC) have performance advantages over MPI  But MPI could be implemented using load/store, inlined, and optimized  On the other hand, easier to avoid pack/unpack with load/store model

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 38 Co-Array-Fortran Rant  No such thing as one-sided communication  It’s all two sided: send+receive, sync+put+sync, sync+get+sync  Same parallel algorithms  CAF mods can be highly nonlocal  Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc.  Rarely the case for MPI  We use CAF to avoid MPI-implementation performance inadequacies  Avoiding nonlocality by cheating with Cray pointers

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 39 Cray Rant  Cray XD1 (OctigaBay) follows in tradition of T3E

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 40 Cray Rant  Cray XD1 (OctigaBay) follows in tradition of T3E  Very promising architecture  Dumb name  Interesting competitor with Red Storm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 41 Questions? James B. White III (Trey)