Bill Camp, Jim Tomkins & Rob Leland.

Name: Bill Camp, Jim Tomkins & Rob Leland.
Uploaded: 2017-12-17T08:41:28+00:00
Duration: PTM29S12
Channel: Georgia Goodman
Description: Bill Camp, Jim Tomkins & Rob Leland.

Bill Camp, Jim Tomkins & Rob Leland

How Much Commodity is Enough? the Red Storm Architecture
William J. Camp, James L. Tomkins & Rob Leland CCIM, Sandia National Laboratories Albuquerque, NM

Sandia MPPs (since 1987) 1987: 1024-processor nCUBE10 [512 Mflops]
: processor nCUBE-2 machines 2 Gflops] : processor CM-200 1991: 64-processor Intel IPSC-860 : ~3700-processor Intel Paragon [180 Gflops] 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops] 1997--present: > 2800 processors in Cplant Linux Cluster [~3 Tflops] 2003: 1280-processor IA32- Linux cluster [~7 Tflops] 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]

Our rubric (since 1987) Complex, mission-critical, engineering & science applications Large systems (1000’s of PE’s) with a few processors per node Message passing paradigm Balanced architecture Use commodity wherever possible Efficient systems software Emphasis on scalability & reliability in all aspects Critical advances in parallel algorithms Vertical integration of technologies

A partitioned, scalable computing architecture
Net I/O Service Users File I/O Compute /home

Computing domains at Sandia
Volume Mid-Range Peak Domain # Procs 1 101 102 103 104 Red Storm X Cplant Linux Supercluster Beowulf clusters Desktop Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)

Red Storm Architecture
True MPP, designed to be a single system-- not a cluster Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network 108 compute node cabinets and 10,368 compute node processors (AMD GHz) ~10 or 20 TB of DDR 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 (for data security) 12 Service, Visualization, and I/O cabinets on each end (640 S,V & I processors for each color) 240 TB of disk storage (120 TB per color) initially

Red Storm Architecture
Functional hardware partitioning: service and I/O nodes, compute nodes, Visualization nodes, and RAS nodes Partitioned Operating System (OS): LINUX on Service, Visualization, and I/O nodes, LWK (Catamount) on compute nodes, LINUX on RAS nodes Separate RAS and system management network (Ethernet) Router table-based routing in the interconnect Less than 2 MW total power and cooling Less than 3,000 ft2 of floor space

Unix (Linux) Login Node with Unix environment
Usage Model Unix (Linux) Login Node with Unix environment Compute Resource I/O Batch Processing or User sees a coherent, single system

Thor’s Hammer Topology
3D-mesh Compute node topology: 27 x 16 x 24 (x, y, z) – Red/Black split: 2,688 – 4,992 – 2,688 Service, Visualization, and I/O partitions 3 cab’s on each end of each row 384 full bandwidth links to Compute Node Mesh Not all nodes have a processor-- all have routers 256 PE’s in each Visualization Partition--2 per board 256 PE’s in each I/O Partition-- 2 per board 128 PE’s in each Service Partition-- 4 per board

3-D Mesh topology (Z direction is a torus)
Torus Interconnect in Z Y=16 640 Visualization, Service & I/O Nodes 10,368 Compute Node Mesh 640 Visualization Service & I/O Nodes Z=24 X=27

Thor’s Hammer Network Chips
3D-mesh is created by SEASTAR ASIC: Hyper-transport Interface and 6 network router ports on each chip In computer partitions each processor has its own SEASTAR In service partition, some boards are configured like compute partition (4 PE’s per board) Others have only 2 PE’s per board; but still have 4 SEASTARS So, network topology is uniform SEASTAR designed by CRAY to our spec’s, Fabricated by IBM The only truly custom part in Red Storm-- complies with HT open standard

Node architecture DRAM 1 (or 2) Gbyte or more CPU AMD Opteron
Six Links To Other Nodes in X, Y, and Z ASIC NIC + Router ASIC = Application Specific Integrated Circuit, or a “custom chip”

System Layout (27 x 16 x 24 mesh)
Normally Unclassified { Switchable Nodes Normally Classified { Disconnect Cabinets

Thor’s Hammer Cabinet Layout
Compute Node Cabinet CPU Boards Fan Power Supply Cables Front Side 2 ft 4 ft Compute Node Partition 3 Card Cages per Cabinet 8 Boards per Card Cage 4 Processors per Board 4 NIC/Router Chips per Board N + 1 Power Supplies Passive Backplane Service. Viz, and I/O Node Partition 2 (or 3) Card Cages per Cabinet 2 (or 4) Processors per Board 2-PE I/O Boards have 4 PCI-X busses } 96 PE

Performance Peak of 41.4 (46.6) TF based on 2 floating point instruction issues per clock at 2.0 Gigahertz . We required 7-fold speedup versus ASCI Red but based on our benchmarks expect performance will be 8-10 time faster than ASCI Red. Expected MP-Linpack performance: ~ TF Aggregate system memory bandwidth: ~55 TB/s Interconnect Performance: Latency <2 ms (neighbor), <5 ms (full machine) Link bandwidth ~ 6.0 GB/s bi-directional Minimal XC bi-section bandwidth ~2.3 TB/s

Performance I/O System Performance Node memory system
Sustained file system bandwidth of 50 GB/s for each color Sustained external network bandwidth of 25 GB/s for each color Node memory system Page miss latency to local memory is ~80 ns Peak bandwidth of ~5.4 GB/s for each processor

Red Storm System Software
Operating Systems LINUX on service and I/O nodes Sandia’s LWK (Catamount) on compute nodes LINUX on RAS nodes Run-Time System Logarithmic loader Fast, efficient Node allocator Batch system – PBS Libraries – MPI, I/O, Math File Systems being considered include PVFS – interim file system Lustre – Design Intent Panassas-- possible alternative …

Red Storm System Software
Tools All IA32 Compilers, all AMD 64-bit Compilers – Fortran, C, C++ Debugger – Totalview (also examining alternatives) Performance Tools (was going to be Vampir until Intel bought Pallas-- now?) System Management and Administration Accounting RAS GUI Interface

Comparison of ASCI Red and Red Storm
Full System Operational Time Frame June 1997 (processor and memory upgrade in 1999) August 2004 Theoretical Peak (TF)-- compute partition alone 3.15 41.47 MP-Linpack Performance (TF) 2.38 >30 (estimated) Architecture Distributed Memory MIMD Number of Compute Node Processors 9,460 10,368 Processor Intel P 333 MHz AMD 2 GHz Total Memory 1.2 TB 10.4 TB (up to 80 TB) System Memory Bandwidth 2.5 TB/s 55 TB/s Disk Storage 12.5 TB 240 TB Parallel File System Bandwidth 1.0 GB/s each color 50.0 GB/s each color External Network Bandwidth 0.2 GB/s each color 25 GB/s each color

Comparison of ASCI Red and Red Storm
Interconnect Topology 3D Mesh (x, y, z) 38 x 32 x 2 3D Mesh (x, y, z) 27 x 16 x 24 Interconnect Performance MPI Latency Bi-Directional Bandwidth Minimum Bi-section Bandwidth 15 ms 1 hop, 20 ms max 800 MB/s 51.2 GB/s 2.0 ms 1 hop, 5 ms s max 6.0 GB/s 2.3 TB/s Full System RAS RAS Network RAS Processors 10 Mbit Ethernet 1 for each 32 CPUs 100 Mbit Ethernet 1 for each 4 CPUs Operating System Compute Nodes Service and I/O Nodes RAS Nodes Cougar TOS (OSF1 UNIX) VX-Works Catamount LINUX LINUX Red/Black Switching 2260 – 4940 – 2260 2688 – System Foot Print ~2500 ft2 ~3000 ft2 Power Requirement 850 KW 1.7 MW

Red Storm Project 23 months, design to First Product Shipment!
System software is a joint project between Cray and Sandia Sandia is supplying Catamount LWK and the service node run-time system Cray is responsible for Linux, NIC software interface, RAS software, file system software, and Totalview port Initial software development was done on a cluster of workstations with a commodity interconnect. Second stage involves an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based system System design is going on now Cabinets-- exist SEASTAR NIC/Router-- released to Fabrication at IBM earlier this month Full system to be installed and turned over to Sandia in stages culminating in August--September 2004

New Building for Thor’s Hammer

Designing for scalable supercomputing
Challenges in: -Design -Integration -Management -Use

SUREty for Very Large Parallel Computer Systems
Scalability - Full System Hardware and System Software Usability - Required Functionality Only Reliability - Hardware and System Software Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:

SURE Architectural tradeoffs:
Processor and memory sub- system balance Compute vs interconnect balance Topology choices Software choices RAS Commodity vs. Custom technology Geometry and mechanical design

Sandia Strategies: -build on commodity -leverage Open Source (e.g., Linux) -Add to commodity selectively (in RS there is basically one truly custom part!) -leverage experience with previous scalable supercomputers

System Scalability Driven Requirements
Overall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors). -

Scalability System Software;
System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size. - Full re-boot time scales logarithmically with the system size. - Job loading is logarithmic with the number of processors. - Parallel I/O performance is not sensitive to # of PEs doing I/O - Communication Network software must be scalable. - No connection-based protocols among compute nodes. - Message buffer space independent of # of processors. - Compute node OS gets out of the way of the application.

Hardware scalability Balance in the node hardware:
Memory BW must match CPU speed Ideally 24 Bytes/flop (never yet done) Communications speed must match CPU speed I/O must match CPU speeds Scalable System SW( OS and Libraries) Scalable Applications

SW Scalability: Kernel tradeoffs
Unix/Linux has more functionality, but impairs efficiency on the compute partition at scale Say breaks are uncorrelated and last 50 S and occur once per second On 100 CPUs, wasted time is 5 ms every second Negligible .5% impact On 1000 CPUs, wasted time is 50 ms every second Moderate 5% impact On 10,000 CPUs, wasted time is 500 ms Significant 50% impact

Usability >Application Code Support: Software that supports scalability of the Computer System Math Libraries MPI Support for Full System Size Parallel I/O Library Compilers Tools that Scale to the Full Size of the Computer System Debuggers Performance Monitors Full-featured LINUX OS support at the user interface

Reliability Light Weight Kernel (LWK) O. S. on compute partition
Much less code fails much less often Monitoring of correctible errors Fix soft errors before they become hard Hot swapping of components Overall system keeps running during maintenance Redundant power supplies & memories Completely independent RAS System monitors virtually every component in system

Economy Use high-volume parts where possible Minimize power requirements Cuts operating costs Reduces need for new capital investment Minimize system volume Reduces need for large new capital facilities Use standard manufacturing processes where possible-- minimize customization Maximize reliability and availability/dollar Maximize scalability/dollar Design for integrability

Economy Red Storm leverages economies of scale
AMD Opteron microprocessor & standard memory Air cooled Electrical interconnect based on Infiniband physical devices Linux operating system Selected use of custom components System chip ASIC Critical for communication intensive applications Light Weight Kernel Truly custom, but we already have it (4th generation)

Cplant on a slide ASCI Red Goal: MPP “look and feel”
Start ~1997, upgrade ~ Alpha & Myrinet, mesh topology ~3000 procs (3Tf) in 7 systems Configurable to ~1700 procs Red/Black switching Linux w/ custom runtime & mgmt. Production operation for several yrs. Net I/O System Support Service Sys Admin Users File I/O Compute /home other I/O Nodes Compute Nodes Service Nodes … Ethernet ATM Operator(s) HiPPI I/O Nodes System ASCI Red

IA-32 Cplant on a slide ASCI Red Goal: Mid-range capacity
Started 2003, upgrade annually Pentium-4 & Myrinet, Clos network 1280 procs (~7 Tf) in 3 systems Currently configurable to 512 procs Linux w/ custom runtime & mgmt. Production operation for several yrs. Net I/O System Support Service Sys Admin Users File I/O Compute /home other I/O Nodes Compute Nodes Service Nodes … Ethernet ATM Operator(s) HiPPI I/O Nodes System ASCI Red

Observation: For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs. There must be balance between processor, interconnect, and I/O performance to achieve overall performance. To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.

Let’s Compare Balance In Parallel Systems
10000 2500 24000 2650 64000 500 1000 666 1200 400 Node Speed Rating(MFlops) 0.2 650 Q* 0.04 Q** 0.083 2000 White 0.11 (0.05) 300 (132) Blue Pacific 0.02 (0.16*) 1200 (9600*) BlueMtn** 1.6 800 Blue Mtn* 0.14 140 Cplant (1.2)0.67 800(533) ASCI RED** 1 T3E 2(1.33) ASCI RED Communications Balance (Bytes/flop) Network Link BW (Mbytes/s) Machine

Comparing Red Storm and BGL
Blue Gene Light** Red Storm* Node Speed GF GF (1x) Node Memory GB (1--8 ) GB (4x nom.) Network latency msecs msecs (2/7 x) Network BW GB/s GB/s (22x) BW Bytes/Flops (22x) Bi-Section B/F (24x) #nodes/problem , , (1/4 x) *100 TF version of Red Storm * * 360 TF version of BGL

Fixed problem performance
Molecular dynamics problem (LJ liquid)

Parallel Sn Neutronics (provided by LANL)

Scalable computing works

Balance is critical to scalability
Peak Linpack Scientific & eng. codes

Relating scalability and cost
Cluster more cost effective MPP more cost effective Efficiency ratio = Cost ratio = 1.8 Average efficiency ratio over the five codes that consume >80% of Sandia’s cycles

Scalability determines cost effectiveness
Sandia’s top priority computing workload: Cluster more cost effective MPP more cost effective 55M node-hrs 380M node-hrs 256

Scalability also limits capability
~3x processors

Commodity nearly everywhere-- Customization drives cost
Earth Simulator and Cray X-1 are fully custom Vector systems with good balance This drives their high cost (and their high performance). Clusters are nearly entirely high-volume with no truly custom parts Which drives their low-cost (and their low scalability) Red Storm uses custom parts only where they are critical to performance and reliability High scalability at minimal cost/performance

Scaling data for some key engineering codes
Random variation at small proc. counts Large differential in efficiency at large proc. counts

Scaling data for some key physics codes
Los Alamos’ Radiation transport code

Bill Camp, Jim Tomkins & Rob Leland.

Similar presentations

Presentation on theme: "Bill Camp, Jim Tomkins & Rob Leland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bill Camp, Jim Tomkins & Rob Leland.

Similar presentations

Presentation on theme: "Bill Camp, Jim Tomkins & Rob Leland."— Presentation transcript:

Similar presentations

About project

Feedback