Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.

CHEP 2012 Computing in High Energy and Nuclear Physics Forrest Norrod Vice President and General Manager, Servers.

Today’s topics Single processors and the Memory Hierarchy

Parallel Research at Illinois Parallel Everywhere

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Some Thoughts on Technology and Strategies for Petaflops.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Lecture 1: Introduction to High Performance Computing.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

1 HPEC 9/22/09 Peter M. Kogge McCourtney Chair in Computer Science & Engineering Univ. of Notre Dame IBM Fellow (retired) May 5, 2009 TeraFlop Embedded.

Exascale Evolution 1 Brad Benton, IBM March 15, 2010.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

CENG 546 Dr. Esma Yıldırım. Copyright © 2012, Elsevier Inc. All rights reserved What is a computing cluster?  A computing cluster consists of.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Future of High Performance Computing Thom Dunning National Center.

Lecture 2 : Introduction to Multicore Computing

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Sun Fire™ E25K Server Keith Schoby Midwestern State University June 13, 2005.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

March 9, 2015 San Jose Compute Engineering Workshop.

1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL.

ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.

High Performance Computing

B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

Understanding Parallel Computers Parallel Processing EE 613.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Hardware Architecture

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

M. Bellato INFN Padova and U. Marconi INFN Bologna

Computer Hardware.

NVIDIA’s Extreme-Scale Computing Project

Appro Xtreme-X Supercomputers

Clusters of Computational Accelerators

Graphics Processing Unit

Presentation transcript:

Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University

Path to Exascale Hardware Evolution Key Challenges for Hardware System Software – Runtime Systems – Programming Interface/ Compilation Techniques Algorithm Design DoEs Efforts in Exascale Computing

Hardware Evolution Processor/ Node Architecture Coprocessors – SIMD Units (GP GPUs) – FPGAs Memory/ I/O Considerations Interconnects

Processor/ Node Architectures Intel Platforms: The Sandy Bridge Architecture Up to 8 cores (16 threads), up to 3.8 GHz (turbo-boost), DDR Memory at 51 GB/s, 64 KB L1 (3 cycles), 256 KB L2 (8 cycles), 20 MB L3.

Processor/ Node Architectures Intel Platforms: Knights Corner (MIC) Over 50 cores, with each core operating at 1.2GHz, supported by 512-bit vector processing units, 8MB of cache, and four threads per core. It can be coupled with up to 2GB of GDDR5 memory. The chip uses the Sandy Bridge architecture, and will be manufactured using a 22nm process.

Processor/ Node Architectures AMD Platforms

Processor/ Node Architectures AMD Platforms: Llano APU Four x86 Cores (Stars architecture), 1MB L2 on each core, GPU on chip with 480 stream processors.

Processor/ Node Architectures IBM Power 7. Eight cores, up to 4.25 GHz, 32 threads, 32 KB L1 (2 cycles), 256 KB L2 (8 cycles), and 32 MB of L3 (embedded DRAM), up to 100 GB/s of memory bandwidth

Coprocessor/GPU Architectures nVidia Fermi (GeForce 590)/Kepler/Maxwell. Sixteen streaming multiprocessors (SMs), each with 32 stream processors (512 CUDA cores), 48 KB/SM memory, 768KB L2, 772 MHz core, 3GB GDDR5, 1.6TFLOP peak

Coprocessor/FPGA Architectures Xilinx/Altera/Lattice Semiconductor FPGAs typically interface to PCI/PCIe buses and can significantly accelerate compute-intensive applications by orders of magnitude.

Petascale Parallel Architectures: Blue Waters IH Server Node 8 QCM’s (256 cores) 8 TF (peak) 1 TB memory 4 TB/s memory bw 8 Hub chips Power supplies PCIe slots Fully water cooled Quad-chip Module 4 Power7 chips 128 GB memory 512 GB/s memory bw 1 TF (peak) Hub Chip 1,128 GB/s bw Power7 Chip 8 cores, 32 threads L1, L2, L3 cache (32 MB) Up to 256 GF (peak) 128 Gb/s memory bw 45 nm technology Blue Waters Building Block 32 IH server nodes 256 TF (peak) 32 TB memory 128 TB/s memory bw 4 Storage systems (>500 TB) 10 Tape drive connections

Petascale Parallel Architectures: Blue Waters Each MCM has a hub/switch chip. The hub chip provides 192 GB/s to the directly connected POWER7 MCM; 336 GB/s to seven other nodes in the same drawer on copper connections; 240 GB/s to 24 nodes in the same supernode (composed of four drawers) on optical connections; 320 GB/s to other supernodes on optical connections; and 40 GB/s for general I/O, for a total of 1,128 GB/s peak bandwidth per hub chip. System interconnect is a fully connected two-tier network. In the first tier, every node has a single hub/switch that is directly connected to the other 31 hub/switches in the same supernode. In the second tier, every supernode has a direct connection to every other supernode.

Petascale Parallel Architectures: Blue Waters I/O and Data archive Systems – Storage subsystems On-line disks: > 18 PB (usable) Archival tapes: Up to 500 PB – Sustained disk transfer rate: > 1.5 TB/sec – Fully integrated storage system: GPFS + HPSS

Petascale Parallel Architectures: XT6 Two Gemini interconnects on the left (which is the back of the blade), with four two- socket server nodes and their related memory banks Gemini Interconnect Up to 192 cores ( s) go into a rack, 2304 cores per system cabinet (12 racks) for 20 TFLOPS/cabinet. The largest current installation is a 20 cabinet installation at Edinburgh (roughly 360 TFLOPS).

Current Petascale Platforms ORNLNCSALLNL System AttributeJag. (#1)Blue Wat.Sequoia Vendor (Model)Cray (XT5)IBM (PERCS)IBM BG/Q ProcessorAMD Opt.IBM Power7PowerPC Peak Perf. (PF)2.3~10~20 Sustained Perf. (PF) ≳ 1 Cores/Chip6816 Processor Cores224,256>300,000> 1.6M Memory (TB)299~1,200~1,600 On-line Disk Storage (PB)5>18~50 Disk Transfer (TB/sec)0.24> Archival Storage (PB)20up to 500 Dunning et al. 2010

Heterogeneous Platforms: TianHe 1 14,336 Xeon X5670 processors and 7,168 Nvidia Tesla M2050 general purpose GPUs.XeonNvidia Teslageneral purpose GPUs Theoretical peak performance of petaFLOPS 112 cabinets, 12 storage cabinets, 6 communications cabinets, and 8 I/O cabinets. Each cabinet is composed of four frames, each frame containing eight blades, plus a 16-port switching board. Each blade is composed of two nodes, with each compute node containing two Xeon X core processors and one Nvidia M2050 GPU processors. 2PB Disk and 262 TB RAM. Arch interconnect links the server nodes together using optical-electric cables in a hybrid fat tree configuration. The switch at the heart of Arch has a bi-directional bandwidth of 160 Gb/sec, a latency for a node hop of 1.57 microseconds, and an aggregate bandwidth of more than 61 Tb/sec.

Heterogeneous Platforms: RoadRunner 13K Cell processors, 6500 Opteron 2210 processors, 103 TB RAM, 1.3 PFLOPS.

From 20 to 1000 PFLOPS Several critical issues must be addressed in hardware, systems software, algorithms, and applications – Power (GFLOPS/w) – Fault Tolerance (MTBF and high component count) – Runtime Systems, Programming Models, Compilation – Scalable Algorithms – Node Performance (esp. in view of limited memory) – I/O (esp. in view of limited I/O bandwidth) – Heterogeneity (application composition) – Application Level Fault Tolerance – (and many many others)

Exascale Hardware Challenges DARPA Exascale Technology Study [Kogge et al.] Evolutionary Strawmen – “Heavyweight” Strawman based on commodity- derived microprocessors – “Lightweight” Strawman based on custom microprocessors Aggressive Strawman – “Clean Sheet of Paper” CMOS Silicon

Exascale Hardware Challenges Supply voltages are unlikely to reduce significantly. Processor clocks are unlikely to increase significantly.

Exascale Hardware Challenges

Power Distribution Memory 9% Routers 33% Random 2% Processors 56% Silicon Area Distribution Processors 3% Routers 3% Memory 86% Random 8% Board Area Distribution Memory 10% Processors 24% Routers 8% White Space 50% Random 8% Current HPC System Characteristics [Kogge]

Exascale Hardware Challenges

Faults and Fault Tolerance Estimated chip counts in exascale systems Failures in current terascale systems

Faults and Fault Tolerance Failures in time (10 9 hours) for a current Blue-Gene system.

Faults and Fault Tolerance Mean time to interrupt for a 220K socket system in 2015 results in a best case time of 24 mins!

Faults and Fault Tolerance At one socket failure on average every 10 years (!), application utilization drops to 0% at 220K sockets!

So what do we learn? Power is a major consideration Faults and fault tolerance are major issues For these reasons, evolutionary path to exascale is unlikely to succeed Constraints on power density constrain processor speed – thus emphasizing concurrency Levels of concurrency needed to reach exascale are projected to be over 10 9 cores.

DoE’s View of Exascale Platforms

Exascale Computing Challenges Programming Models, Compilers, and Runtime Systems Is CUDA/Pthreads/MPI the programming model of choice?  Unlikely, considering heterogeneity Partitioned Global Arrays One Sided Communications (often underlie PGAs) Node Performance (autotuning libraries) Novel Models (fault-oblivious programming models)

Exascale Computing Challenges Algorithms and Performance Need for extreme scalability (10 8 cores and beyond)  Consideration 0: Amdahl! Speedup is limited by 1/s, where s is the serial fraction of the computation  Consideration 1: Useful work at each processor must amortize overhead Overhead (communication, synchronization) typically increases with number of processors In this case, constant work per processor (weak scaling) does not amortize overhead (resulting in reduced efficiency)

Exascale Computing Challenges Algorithms and Performance: Scaling Memory constraints fundamentally limit scaling  Emphasis on strong scaling performance Key challenges:  Reducing global communications  Increasing locality in a hierarchical fashion (off-chip, off-blade, off-rack, off-cluster)

Exascale Computing Challenges Algorithms: Dealing with Faults Hardware and system software for fault tolerance may be inadequate (checkpointing in view of limited I/O bandwidth is infeasible) Application checkpointing may not be feasible either Can we design algorithms that are inherently oblivious to faults?

Exascale Computing Challenges Input/Output, Data Analysis Constrained I/O bandwidth Unfavorable secondary storage/RAM ratio High latencies to remote disks Optimizations through system interconnect Integrated data analytics

Exascale Computing Challenges

Exascale Computing Challenges

Exascale Consortia and Projects DoE Workshops Challenges for the Understanding the Quantum Universe and the Role of Computing at the Extreme Scale (Dec ‘08) Forefront Questions in Nuclear Science and the Role of Computing at the Extreme Scale (Jan ‘09) Science Based Nuclear Energy Systems Enabled by Advanced Modeling and Simulation at the Extreme Scale (May ‘09) Opportunities in Biology at the Extreme Scale of Computing (Aug ‘09) Discovery in Basic Energy Sciences: The Role of Computing at the Extreme Scale (Aug ‘09) Architectures and Technology for Extreme Scale Computing (Dec ‘09) Cross-Cutting Technologies for Computing at the Exascale Workshop (Feb ‘10) The Role of Computing at the Extreme Scale/ National Security (Aug ‘10)

DoEs Exascale Investments: Driving Applications

DoE’s Approach to Exascale Computations

Scope of DoE’s Exascale Initiative

Budget 2012

Thank you!