Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
OS Spring’03 Introduction Operating Systems Spring 2003.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Figure 1.1 Interaction between applications and the operating system.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
 Chasis / System cabinet  A plastic enclosure that contains most of the components of a computer (usually excluding the display, keyboard and mouse)
Computer Organization Review and OS Introduction CS550 Operating Systems.
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Computer Organization & Assembly Language © by DR. M. Amer.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Interconnection network network interface and a case study.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
Mr C Johnston ICT Teacher
Department of Computer Science University of California, Santa Barbara
BlueGene/L Supercomputer
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Department of Computer Science University of California, Santa Barbara
Cluster Computers.
Presentation transcript:

Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini Performance and Architectures Lab (PAL), CCS-3

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 2 CCS-3 P AL Overview n In this part of the tutorial we will discuss the characteristics of some of the most powerful supercomputers n We classify these machines along three dimensions u Node Integration - how processors and network interface are integrated in a computing node u Network Integration – what primitive mechanisms the network provides to coordinate the processing nodes u System Software Integration – how the operating system instances are globally coordinated

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 3 CCS-3 P AL Overview n We argue that the level of integration in each of the three dimensions, more than other parameters (as distributed vs shared memory or vector vs scalar processor), is the discriminating factor beween large-scale supercomputers n In this part of the tutorial we will briefly characterize some existing and up-coming parallel computers

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 4 CCS-3 P AL ASCI Q: Los Alamos National Laboratory

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 5 CCS-3 P AL ASCI Q  Total — TF/s, #3 in the top 500  Systems — 2048 AlphaServer ES45s  8,192 EV GHz CPUs with 16-MB cache  Memory — 22 Terabytes  System Interconnect  Dual Rail Quadrics Interconnect  4096 QSW PCI adapters  Four 1024-way QSW federated switches  Operational in 2002

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 6 CCS-3 P AL Memory Up to 32 GB MMB 2 MMB 1 MMB 0 Serial, Parallel keyboard/mouse floppy Cache 16 MB per CPU 256b 125 MHz (4.0 GB/s) 256b 125 MHz (4.0 GB/s) EV GHz PCI5PCI4 PCI0 PCI2 PCI1 PCI6PCI7 PCI-USB PCI-junk IO PCI3PCI8 PCI 9 64b 33MHz (266MB/S) 64b 66MHz (528 MB/S ) PCI5PCI4 PCI0 PCI2 PCI1 64b 66MHz (528 MB/S) PCI6PCI7 PCI-USB PCI-junk IO PCI3PCI8 PCI 9 64b 33 MHz (266 MB/S) 64b 66 MHz (528 MB/S) Quad C-Chip Controller PCI Chip Bus 0 PCI Chip Bus 1 D D D D DD D D Quad C-Chip Controller PCI Chip Bus 0,1 PCI Chip Bus 2,3 D D D D DD D D MMB 3 PCI7 HS PCI5PCI4 PCI3 HSPCI2 HSPCI1 HS PCI0 64b 500 MHz (4.0 GB/s) PCI9 HSPCI8 HSPCI6 HS 3.3V I/O5.0V I/O Node: HP (Compaq) AlphaServer ES System Architecture

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 7 CCS-3 P AL QsNET: Quaternary Fat Tree Hardware support for Collective Communication MPI Latency 4  s, Bandwidth 300 MB/s Barrier latency less than 10  s

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 8 CCS-3 P AL Interconnection Network 1 st 64U64D Nodes th 64U64D Nodes Switch Level Mid Level Super Top Level 1024 nodes (2x = 2048 nodes)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 9 CCS-3 P AL System Software n Operating System is Tru64 n Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) n Resource management executed through Ethernet (RMS)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 10 CCS-3 P AL ASCI Q: Overview n Node Integration u Low (multiple boards per node, network interface on I/O bus) n Network Integration u High (HW support for atomic collective primitives) n System Software Integration u Medium/Low (TruCluster)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 11 CCS-3 P AL ASCI Thunder, 1,024 Nodes, 23 TF/s peak

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 12 CCS-3 P AL ASCI Thunder, Lawrence Livermore National Laboratory 1,024 Nodes, 4096 Processors, 23 TF/s, #2 in the top 500

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 13 CCS-3 P AL ASCI Thunder: Configuration n 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) 2.5  s, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 Barrier synchronization 6  s, allreduce 15  s n 75 TB in local disk in 73GB/node UltraSCSI320 n Lustre file system with 6.4 GB/s delivered parallell I/O performance n Linux RH 3.0, SLURM, Chaos

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 14 CCS-3 P AL n CHAOS: Clustered High Availability Operating System u Derived from Red Hat, but differs in the following areas F Modified kernel (Lustre and hw specific) F New packages for cluster monitoring, system installation, power/console management F SLURM, an open-source resource manager

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 15 CCS-3 P AL ASCI Thunder: Overview n Node Integration u Medium/Low (network interface on I/O bus) n Network Integration u Very High (HW support for atomic collective primitives) n System Software Integration u Medium (Chaos)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 16 CCS-3 P AL System X: Virginia Tech

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 17 CCS-3 P AL System X, TF/s n 1100 dual Apple G5 2GHz CPU based nodes. u 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. n Each node has 4GB of main memory and 160 GB of Serial ATA storage. u 176TB total secondary storage. Infiniband, 8  s and 870 MB/s, latency and bandwidth, partial support for collective communication n System-level Fault-tolerance ( Déjà vu)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 18 CCS-3 P AL System X: Overview n Node Integration u Medium/Low (network interface on I/O bus) n Network Integration u Medium (limited support for atomic collective primitives) n System Software Integration u Medium (system-level fault-tolerance)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 19 CCS-3 P AL Chip (2 processors) Compute Card (2 chips, 2x1x1) Node Card (32 chips, 4x4x2) 16 Compute Cards System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) 2.8/5.6 GF/s 4 MB 5.6/11.2 GF/s 0.5 GB DDR 90/180 GF/s 8 GB DDR 2.9/5.7 TF/s 256 GB DDR 180/360 TF/s 16 TB DDR BlueGene/L System

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 20 CCS-3 P AL BlueGene/L Compute ASIC PLB (4:1) “Double FPU” Ethernet Gbit JTAG Access 144 bit wide DDR 256/512MB JTAG Gbit Ethernet 440 CPU I/O proc L2 Multiported Shared SRAM Buffer Torus DDR Control with ECC Shared L3 directory for EDRAM Includes ECC 4MB EDRAM L3 Cache or Memory 6 out and 6 in, each at 1.4 Gbit/s link ECC k/32k L1 “Double FPU” 256 snoop Tree 3 out and 3 in, each at 2.8 Gbit/s link Global Interrupt 4 global barriers or interrupts 128 IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 21 CCS-3 P AL

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 22 CCS-3 P AL DC-DC Converters: 40V  1.5, 2.5V 2 I/O cards 16 compute cards

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 23 CCS-3 P AL

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 24 CCS-3 P AL BlueGene/L Interconnection Networks 3 Dimensional Torus u Interconnects all compute nodes (65,536) u Virtual cut-through hardware routing u 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) u 350/700 GBytes/s bisection bandwidth u Communications backbone for computations Global Tree u One-to-all broadcast functionality u Reduction operations functionality u 2.8 Gb/s of bandwidth per link u Latency of tree traversal in the order of 5 µs u Interconnects all compute and I/O nodes (1024) Ethernet u Incorporated into every node ASIC u Active in the I/O nodes (1:64) u All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier u 8 single wires crossing whole system, touching all nodes Control Network (JTAG) u For booting, checkpointing, error logging

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 25 CCS-3 P AL BlueGene/L System Software Organization n Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) n I/O nodes run Linux and provide O/S services u file access u process launch/termination u debugging n Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 26 CCS-3 P AL Operating Systems n Compute nodes: CNK u Specialized simple O/S F 5000 lines of code, F 40KBytes in core u No thread support, no virtual memory u Protection F Protect kernel from application F Some net devices in userspace u File I/O offloaded (“function shipped”) to IO nodes F Through kernel system calls u “Boot, start app and then stay out of the way” n I/O nodes: Linux u kernel (2.6 underway) w/ ramdisk u NFS/GPFS client u CIO daemon to F Start/stop jobs F Execute file I/O n Global O/S (CMCS, service node) u Invisible to user programs u Global and collective decisions u Interfaces with external policy modules (e.g., job scheduler) u Commercial database technology (DB2) stores static and dynamic state F Partition selection F Partition boot F Running of jobs F System error logs F Checkpoint/restart mechanism u Scalability, robustness, security n Execution mechanisms in the core n Policy decisions in the service node

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 27 CCS-3 P AL BlueGeneL: Overview n Node Integration u High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) n Network Integration u High (separate tree network) n System Software Integration u Medium/High (Compute kernels are not globally coordinated) n #2 and #4 in the top500

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 28 CCS-3 P AL Cray XD1

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 29 CCS-3 P AL Cray XD1 System Architecture Compute n 12 AMD Opteron 32/64 bit, x86 processors n High Performance Linux RapidArray Interconnect n 12 communications processors n 1 Tb/s switch fabric Active Management n Dedicated processor Application Acceleration n 6 co-processors n Processors directly connected to the interconnect

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 30 CCS-3 P AL Cray XD1 Processing Node Six SATA Hard Drives Four independent PCI-X Slots 500 Gb/s crossbar switch 12-port Inter- chassis connector Connector to 2 nd 500 Gb/s crossbar switch and 12-port inter-chassis connector 4 Fans Chassis Rear Chassis Front Six 2-way SMP Blades

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 31 CCS-3 P AL Cray XD1 Compute Blade 4 DIMM Sockets for DDR 400 Registered ECC Memory 4 DIMM Sockets for DDR 400 Registered ECC Memory RapidArray Communications Processor AMD Opteron 2XX Processor Connector to Main Board AMD Opteron 2XX Processor

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 32 CCS-3 P AL Fast Access to the Interconnect Processor I/OInterconnect GigaBytesGFLOPSGigaBytes per Second Cray XD1 Memory Xeon Server 6.4GB/s DDR GB/s 5.3 GB/s DDR GB/s GigE 1 GB/s PCI-X

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 33 CCS-3 P AL Communications Optimizations RapidArray Communications Processor u HT/RA tunnelling with bonding u Routing with route redundancy u Reliable transport u Short message latency optimization u DMA operations u System-wide clock synchronization RapidArray Communications Processor 2 GB/s 3.2 GB/s 2 GB/s AMD Opteron 2XX Processor RA

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 34 CCS-3 P AL Usability u Single System Command and Control Resiliency u Dedicated management processors, real-time OS and communications fabric. u Proactive background diagnostics with self- healing. u Synchronized Linux kernels Active Manager System Active Management Software

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 35 CCS-3 P AL Cray XD1: Overview n Node Integration u High (direct access from HyperTransport to RapidArray) n Network Integration u Medium/High (HW support for collective communication) n System Software Integration u High (Compute kernels are globally coordinated) n Early stage

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 36 CCS-3 P AL ASCI Red STORM

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 37 CCS-3 P AL Red Storm Architecture n Distributed memory MIMD parallel supercomputer n Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network n 108 compute node cabinets and 10,368 compute node processors (AMD 2.0 GHz) n ~10 TB of DDR 333MHz n Red/Black switching: ~1/4, ~1/2, ~1/4 n 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 38 CCS-3 P AL Red Storm Architecture n Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes n Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes n Separate RAS and system management network (Ethernet) n Router table-based routing in the interconnect

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 39 CCS-3 P AL Net I/O Service Users File I/O Compute /home Red Storm architecture

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 40 CCS-3 P AL System Layout (27 x 16 x 24 mesh) Normally Unclassified Normally Classified Switchable Nodes Disconnect Cabinets {{

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 41 CCS-3 P AL  Run-Time System  Logarithmic loader  Fast, efficient Node allocator  Batch system – PBS  Libraries – MPI, I/O, Math  File Systems being considered include  PVFS – interim file system  Lustre – Pathforward support,  Panassas…  Operating Systems  LINUX on service and I/O nodes  Sandia’s LWK (Catamount) on compute nodes  LINUX on RAS nodes Red Storm System Software

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 42 CCS-3 P AL ASCI Red Storm: Overview n Node Integration u High (direct access from HyperTransport to network through custom network interface chip) n Network Integration u Medium (No support for collective communication) n System Software Integration u Medium/High (scalable resource manager, no global coordination between nodes) n Expected to become the most powerful machine in the world (competition permitting)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 43 CCS-3 P AL Overview Node Integration Network Integration Software Integration ASCI Q ASCI Thunder System X BlueGene/L Cray XD1 Red Storm

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 44 CCS-3 P AL A Case Study: ASCI Q n We try to provide some insight on the what we perceive are the important problems in a large-scale supercomputer n Our hands-on experience on ASCI Q shows that the system software and the global coordination are fundamental in a large-scale parallel machine

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 45 CCS-3 P AL ASCI Q n 2,048 ES45 Alphaservers, with 4 processors/node n 16 GB of memory per node n 8,192 processors in total n 2 independent network rails, Quadrics Elan3 n > 8192 cables n 20 Tflops peak, #2 in the top 500 lists n A complex human artifact

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 46 CCS-3 P AL Dealing with the complexity of a real system n In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q. n This methodology is based on an arsenal of u analytical models u custom microbenchmarks u full applications u discrete event simulators n Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 47 CCS-3 P AL Overview n Our performance expectations for ASCI Q and the reality n Identification of performance factors u Application performance and breakdown into components n Detailed examination of system effects u A methodology to identify operating systems effects u Effect of scaling – up to 2000 nodes/ 8000 processors u Quantification of the impact n Towards the elimination of overheads u demonstrated over 2x performance improvement n Generalization of our results: application resonance n Bottom line: the importance of the integration of the various system across nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 48 CCS-3 P AL Performance of SAGE on 1024 nodes n Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) u Measured time 2x greater than model (4096 PEs) There is a difference why ? Lower is better!

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 49 CCS-3 P AL Using fewer PEs per Node Test performance using 1,2,3 and 4 PEs per node Lower is better!

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 50 CCS-3 P AL Using fewer PEs per node (2) Measurements match model almost exactly for 1,2 and 3 PEs per node! Performance issue only occurs when using 4 PEs per node

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 51 CCS-3 P AL Mystery #1 SAGE performs significantly worse on ASCI Q than was predicted by our model

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 52 CCS-3 P AL SAGE performance components n Look at SAGE in terms of main components: u Put/Get (point-to-point boundary exchange) u Collectives (allreduce, broadcast, reduction) Performance issue seems to occur only on collective operations

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 53 CCS-3 P AL Performance of the collectives n Measure collective performance separately 4 processes per node n Collectives (e.g., allreduce and barrier) mirror the performance of the application

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 54 CCS-3 P AL Identifying the problem within Sage Sage Allreduce Simplify

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 55 CCS-3 P AL Exposing the problems with simple benchmarks Allreduce Benchmarks Add complexity Challenge: identify the simplest benchmark that exposes the problem

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 56 CCS-3 P AL Interconnection network and communication libraries n The initial (obvious) suspects were the interconnection network and the MPI implementation n We tested in depth the network, the low level transmission protocols and several allreduce algorithms n We also implemented allreduce in the Network Interface Card n By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7 n But we only got small improvements in Sage (5%)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 57 CCS-3 P AL Mystery #2 Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce 7 times faster leads to a small performance improvement

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 58 CCS-3 P AL Computational noise n After having ruled out the network and MPI we focused our attention on the compute nodes n Our hypothesis is that the computational noise is generated inside the processing nodes n This noise “freezes” a running process for a certain amount of time and generates a “computational” hole

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 59 CCS-3 P AL Computational noise: intuition n Running 4 processes on all 4 processors of an Alphaserver ES45 P 2 P 0 P 1 P 3 l The computation of one process is interrupted by an external event (e.g., system daemon or kernel)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 60 CCS-3 P AL IDLE Computational noise: 3 processes on 3 processors n Running 3 processes on 3 processors of an Alphaserver ES45 P 2 P 0 P 1 l The “noise” can run on the 4 th processor without interrupting the other 3 processes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 61 CCS-3 P AL Coarse grained measurement n We execute a computational loop for 1,000 seconds on all 4,096 processors of QB P 1 P 2 P 3 P 4 TIME STARTEND

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 62 CCS-3 P AL Coarse grained computational overhead per process n The slowdown per process is small, between 1% and 2.5% lower is better

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 63 CCS-3 P AL Mystery #3 Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the microbenchmarks of per-processor noise indicate that at most 2.5% of performance is lost to noise

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 64 CCS-3 P AL Fine grained measurement n We run the same benchmark for 1000 seconds, but we measure the run time every millisecond n Fine granularity representative of many ASCI codes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 65 CCS-3 P AL Fine grained computational overhead per node n We now compute the slowdown per-node, rather than per- process n The noise has a clear, per cluster, structure Optimum is 0 (lower is better)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 66 CCS-3 P AL Finding #1 Analyzing noise on a per-node basis reveals a regular structure across nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 67 CCS-3 P AL l The Q machine is organized in 32 node clusters (TruCluster) l In each cluster there is a cluster manager (node 0), a quorum node (node 1) and the RMS data collection (node 31) Noise in a 32 Node Cluster

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 68 CCS-3 P AL Per node noise distribution n Plot distribution of one million, 1 ms computational chunks n In an ideal, noiseless, machine the distribution graph is u a single bar at 1 ms of 1 million points per process (4 million per node) n Every outlier identifies a computation that was delayed by external interference n We show the distributions for the standard cluster node, and also nodes 0, 1 and 31

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 69 CCS-3 P AL Cluster Node (2-30) n 10% of the times the execution of the 1 ms chunk of computation is delayed

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 70 CCS-3 P AL Node 0, Cluster Manager n We can identify 4 main sources of noise

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 71 CCS-3 P AL Node 1, Quorum Node n One source of heavyweight noise (335 ms!)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 72 CCS-3 P AL Node 31 n Many fine grained interruptions, between 6 and 8 milliseconds

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 73 CCS-3 P AL The effect of the noise n An application is usually a sequence of a computation followed by a synchronization (collective): n But if an event happens on a single node then it can affect all the other nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 74 CCS-3 P AL Effect of System Size n The probability of a random event occurring increases with the node count

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 75 CCS-3 P AL Tolerating Noise: Buffered Coscheduling (BCS) We can tolerate the noise by coscheduling the activities of the system software on each node

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 76 CCS-3 P AL Discrete Event Simulator: used to model noise n DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise n Noise model closely approximates experimental data n The primary bottleneck is the fine-grained noise generated by the compute nodes (Tru64) Lower is better

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 77 CCS-3 P AL Finding #2 On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 78 CCS-3 P AL Incremental noise reduction 1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff) 2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s) 3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 79 CCS-3 P AL Improvements in the Barrier Synchronization Latency

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 80 CCS-3 P AL Resulting SAGE Performance n Nodes 0 and 31 also configured out in the optimization

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 81 CCS-3 P AL Finding #3 We were able to double SAGE’s performance by selectively removing noise caused by several types of system activities

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 82 CCS-3 P AL Generalizing our results: application resonance n The computational granularity of a balanced bulk- synchronous application correlates to the type of noise. n Intuition: u any noise source has a negative impact, a few noise sources tend to have a major impact on a given application. n Rule of thumb: u the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude n The performance can be enhanced by selectively removing sources of noise n We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 83 CCS-3 P AL Cumulative Noise Distribution, Sequence of Barriers with No Computation n Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 84 CCS-3 P AL Conclusions n Combination of Measurement, Simulation and Modeling to u Identify and resolve performance issues on Q F Used modeling to determine that a problem exists F Developed computation kernels to quantify O/S events: F Effect increases with the number of nodes F Impact is determined by the computation granularity in an application n Application performance has significantly improved n Method also being applied to other large-systems

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 85 CCS-3 P AL About the authors Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University. Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications.