Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National Laboratories has a long history of successfully applying high performance computing (HPC) technology to solve scientific problems. We drew upon our experiences with numerous architectural and design features when planning our most recent computer systems. This talk will present the key issues that were considered. Important principles are performance balance between the hardware components and scalability of the system software. The talk will conclude with lessons learned from the system deployments. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline A definition of HPC for scientific applications Design Principles –Partition Model –Network Topology –Balance of Hardware Components –Scalable System Software Lessons Learned
(n.) A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors. ( Will not talk about embarrassingly parallel applications The idea/premise of scientific parallel processing is not new ( ) What is High Performance Computing?
The Partition Model: Match the hardware & software to its function Applies to both hardware and software Physically and logically divide the system into functional units Compute hardware different configuration than service & I/O Only run the necessary software to perform the function
Usage Model: Partitions cooperate to appear as one system Linux Login (Service) Node Compute Resource I/O
Mesh/Torus topologies are scalable 12,960 Compute Node Mesh X=27 Y=20 Z=24 Torus Interconnect in Z 310 Service & I/O Nodes
Minimize communication interference Jobs occupy disjoint regions simultaneously Example – red, green, and blue jobs: Z=24 X=27 Y=20 12,960 Compute Nodes
Hardware Performance Characteristics that Lead to a Balanced System Network bandwidth must balance with Processor speed and operations per second must balance with Memory bandwidth and capacity must balance with File system I/O bytes per second
In Addition to Balanced Hardware, System Software must be Scalable
Scalable System Software Concept #1 Do things in a hierarchical fashion
Jobs Launch is Hierarchical Compute Node Allocator Job Launch Login Node Linux User Application User Login & Start App Job Scheduler Node Batch mom Scheduler Batch Server … Compute Node Allocator Job Queues Database Node CPU Inventory Database Fan out application
System monitoring is hierarchical
Scalable System Software Concept #2 Minimize Compute Node Operating System Overhead
Operating System Interruptions Impede Progress of the Application
System monitoring is out of band and non-invasive
Scalable System Software Concept #3 Minimize Compute Node Interdependencies
Calculating Weather Minute by Minute Calc 1 0 min Calc 2 1 min Calc 3 2 min Calc 4 3 min4 min
Calculation with Breaks Calculation with Asynchronous Breaks Calc 1 0 min Wait 1 min Calc 2 2 min Calc 3 3 min Wait 4 min5 min Calc 4 6 min
Run Time Impact of Linux Systems Services (aka Daemons) Say breaks take 50 S and occur once per second –On one CPU, wasted time is 50 s every second Negligible.005% impact –On 100 CPUs, wasted time is 5 ms every second Negligible.5% impact –On 10,000 CPUs, wasted time is 500 ms Significant 50% impact
Scalable System Software Concept #4 Avoid linear scaling of buffer requirements
Connection-oriented protocols have to reserve buffers for the worst case If each node reserves a 100KB buffer for its peers, that is 1GB of memory per node for 10,000 processors. Need to communicate using collective algorithms
Scalable System Software Concept #5 Parallelize wherever possible
Use parallel techniques for I/O Compute Nodes I/O Nodes High Speed Network Parallel File System Servers (190 + MDS) 10.0 GigE Servers (50) Login Servers (10) RAIDs 10 Gbit Ethernet1 Gbit Ethernet 140 MB/s per FC X 2 X 190 = 53 GB/s 500 MB/s X 50 = 25 GB/s 1.0 GigE X 10 C C C C C C C C C C C C C C C C C C C C C C C C C C I I I I I I I I I I I I N N L L N N N N N N N N L L L L L L L L
Summary of Principles Partition the hardware and software Hardware –For scalability and upgradability, use a mesh network topology –Determine the right balance of processor speed, memory bandwidth, network bandwidth, and I/O bandwidth for your applications System Software –Do things in a hierarchical fashion –Minimize compute node OS overhead –Minimize compute node interdependencies –Avoid linear scaling of buffer requirements –Parallelize wherever possible
Lessons Learned Seek first to emulate –Learn from the past –Simulate the future Need technology philosophers Tilt Meters Historians Even Tiger Woods has a coach The big bang only worked once –Deploy test platforms early and often Build de-scalable, scalable systems –Don’t forget that you have to get it running first! –Leave the support structures (even non-scalable development tools) in working condition, you’ll need to debug some day Only dead systems never change –Nobody ever built just one system even when successfully deploying just one system –Nothing is ever done just once Build scaffolding that meets the structure –Is build and test infrastructure in place FIRST? –Will it effectively support both the team and the project?