Exascale Evolution 1 Brad Benton, IBM March 15, 2010
Agenda Exascale Challenges On the Path to Exascale: A Look at Blue Waters 2
Exascale Challenges 3
Exascale Challenges Challenges at every level of system design –Managing 500M to 1B (most likely heterogeneous) cores –Programming models to exploit multi-core + accelerators –Interconnect How will IB/RC scale to exascale? How do we “get off the bus”? How can we put more capability in the interconnect –Power Management Power vs. Performance tradeoffs 4
Exascale Challenges Challenges at every level of system design –Resilience/Fault-Tolerance At this scale, something always be broken or in the process of breaking –Development Environment/Performance Tuning –Workflow Management/Process Steering –Data Management/Storage/Visualization 5
Exascale Challenges Resiliency/Fault-Tolerance –F/T Model Fault Detection Fault Isolation Fault Containment Fault Recovery Re-integration –Software Resiliency More than just checkpoint/restart Containers/virtualization suspend/migrate/resume 6
Programming Models MPI –Will it survive in an exascale world? (its demise was predicted at petascale, but seems to be doing okay) Evolve hybrid language models: MPI + “What?” –OpenMP –GPU Accelerators (CUDA, OpenCL) –PGAS languages Greater Exploitation of Autotuning i.e., programs that write progams –ATLAS –FFTW –IBM HPC Toolkit has some of this 7
Title goes here on one line. On the Path to Exascale: A look at Blue Waters 8
NCSA Blue Waters Joint effort between NCSA and University of Illinois First Deliverable of a system based on PERCS technology (2011) Will be the world’s first sustained petascale system for open scientific research for more detailed informationhttp:// 9
Blue Waters Overview Approximately 10 PF/s peak More than 300,000 cores (homogeneous) More than 1 PetaByte memory More than 10 Petabyte disk storage More than 0.5 Exabyte archival storage More than 1 PF/s sustained on scientific applications 10
Building Blue Waters Multi-chip Module 4 Power7 chips 128 GB memory 512 GB/s memory bandwidth 1 TF (peak) Router 1,128 GB/s bandwidth IH Server Node 8 MCM’s (256 cores) 1 TB memory 8 TF (peak) Fully water cooled Blue Waters Building Block 32 IH server nodes 32 TB memory 256 TF (peak) 4 Storage systems 10 Tape drive connections Blue Waters ~1 PF sustained >300,000 cores >1 PB of memory >10 PB of disk storage ~500 PB of archival storage >100 Gbps connectivity Blue Waters is built from components that can also be used to build systems with a wide range of capabilities—from deskside to beyond Blue Waters. Blue Waters will be the most powerful computer in the world for scientific research when it comes on line in Summer of CI Days 22 February 2010 University of Kentucky Power7 Chip 8 cores, 32 threads L1, L2, L3 cache (32 MB) Up to 256 GF (peak) 45 nm technology
Power7 Chip: Computational Heart of Blue Waters Base Technology –45 nm, 576 mm2 –1.2 B transistors Chip –8 cores –12 execution units/core –1, 2, 4 way SMT/core –Up to 4 FMAs/cycle –Caches 32 KB I, D-cache, 256 KB L2/core 32 MB L3 (private/shared) –Dual DDR3 memory controllers 128 GB/s peak memory bandwidth (1/2 byte/flop) –Clock range of 3.5 – 4 GHz Quad-chip MCM Power7 Chip 12
High-End Server Resilience 13
Feeds and Speeds per MCM 32 cores 8 Flop/cycle per core 4 threads per core max 3.5 – 4 GHz 1 TF/s 32 MB L3 512 GB/s memory BW (0.5 Byte/flop) 800 W (0.8 W/flop) 14
First Level Interconnect L-Local HUB to HUB Copper Wiring 256 Cores ONE DRAWER 8 MCMs, 32 chips, 256 cores 15
Interconnect: 1.1 TB/s HUB 192 GB/s Host Connection 336 GB/s to 7 other local nodes in the same drawer 240 GB/s to local-remote nodes in the same supernode (4 drawers) 320 GB/s to remote nodes 40 GB/s to general purpose I/O 16
17
Second Level Interconnect Optical ‘L-Remote’ Links from HUB Construct Super Node (4 CECs) 1,024 Cores Super Node ONE SUPERNODE 4 drawers, 32 MCMs, 128 chips, 1024 cores 18
BPA 200 to 480Vac 370 to 575Vdc Redundant Power Direct Site Power Feed PDU Elimination WCU Facility Water Input 100% Heat to Water Redundant Cooling CRAH Eliminated Storage Unit 4U 0-6 / Rack Up To 384 SFF DASD / Unit File System CECs 2U 1-12 CECs/Rack 256 Cores 128 SN DIMM Slots / CEC 8,16, (32) GB DIMMs 17 PCI-e Slots Imbedded Switch Redundant DCA NW Fabric Up to:3072 cores, 24.6TB (49.2TB) Rack 990.6w x d x 39”w x 72”d x 83”h ~2948kg (~6500lbs) Rack Components Compute Storage Switch 100% Cooling PDU Eliminated Input: 8 Water Lines, 4 Power Cords Out: ~100TFLOPs / 24.6TB / 153.5TB 192 PCI-e 16x / 12 PCI-e 8x 19
How does this affect OFA? Blue Waters can connect externally via PCIe devices (e.g., InfiniBand) as needed Blue Waters interconnect –Is RDMA based –Is not InfiniBand (or iWARP or RoCEE) –Hardware support for Global Shared Memory Pendulum is swinging back to proprietary interconnects (at least at IBM) Is there a path to OFA compatibility? –how can/should OFA accept/support new/different RDMA interconnects? –how can/should IBM work w/OFA for embracing new interconnect technologies? 20
Exascale Evolution Technical Evolution is not always in a straight line Different technologies evolve at different times and rates e.g., Blue Waters is not a direct descendent of RoadRunner/Cell, but rather of POWER/Federation/SP To reach exascale levels will require the consolidation and continued evolution of multiple technologies 21