SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters
SGI’2000Parallel Programming Tutorial MIMD Multiprocessors Single Address space Shared Memory Multicomputers Multiple Address spaces UMA Central Memory NUMA distributed memory NORMA no-remote memory access PVP (Cray T90) SMP (Intel SHV, SUN E10000, DEC 8400 SGI Power Challenge, IBM R60, etc.) COMA (KSR-1, DDM) CC-NUMA (SGI Origin2000, SN1 (SGI3000), Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General) NCC-NUMA (Cray T3D, IBM SP3) Cluster (IBM SP2, DEC TruCluster, Microsoft Wolfpack, “Beowolf”, etc.) loosely coupled, multiple OS “MPP” (Intel TFLOPS,TM-5) tightly coupled & single OS MIMDMultiple Instruction s Multiple DataPVP Parallel Vector Processor UMAUniform Memory Access SMP Symmetric Multi-Processor NUMANon-Uniform Memory Access COMA Cache Only Memory Architecture NORMANo-Remote Memory Access CC-NUMA Cache-Coherent NUMA MPPMassively Parallel Processor NCC-NUMA Non-Cache Coherent NUMA Classification of Computers
SGI’2000Parallel Programming Tutorial Design Space of Competing Computer Architecture
SGI’2000Parallel Programming Tutorial Processor Cache Processor Cache I/O Main Memory Main Memory Main Memory Main Memory Processor Cache Central Bus Structure of an SMP System (1) Does NOT scale due to Bus- saturation Bus is a very complex Component High Memory- Latency due to the Complexity
SGI’2000Parallel Programming Tutorial Central Crossbar Processor Cache Processor Cache I/O Main Memory Main Memory Main Memory Main Memory Processor Cache Structure of an SMP System (2) Scales very well Crossbar is a very complex Component High Memory- Latency due to the Complexity
SGI’2000Parallel Programming Tutorial ^Nodeboard I/O Structure of an SMP System (3) Origin SGI NUMA Architecture SGI NUMA hypercube Global Switch Interconnect N N R R R RR R R R N N N N N N N N N N N N NN ^Nodeboard I/O
SGI’2000Parallel Programming Tutorial Systems are built from Modules Deskside (Module) Rack (2 Modules) Multi-rack (4 Modules) Etc CPUs 16 CPUs..128 CPUs 32 CPUs
SGI’2000Parallel Programming Tutorial SGI Origin 3200 SGI Onyx 3200 SGI Origin 3400 SGI Onyx 3400 SGI Origin 3800 SGI Onyx 3800 New High-End Products Origin 3000 Servers – Onyx 3 Systems IRIX 6.5
SGI’2000Parallel Programming Tutorial SGI 3800 System (16-512p) Minimum (16p) System 128p System 128P System Topology R Rack 1 C C C C R C C C C R Rack 2 C C C C R C C C C R Rack 3 C C C C R C C C C R Rack 4 C C C C R C C C C 1234 Power Bay I-Brick C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay I-Brick P, I, or, X-Brick Power Bay P, I, or, X-Brick Power Bay P, I, or, X-Brick Power Bay P, I, or, X-Brick R-Brick 8-port router C-Brick Power Bay R-Brick C-Brick Power Bay
SGI’2000Parallel Programming Tutorial ASCI Blue Mountain Los Alamos National Laboratories o Origin 2000 with 3+ Tflops peak o 1+ Tflop Application Performance o 48 Systems with 128 CPUs each = 6144 CPUs o 1536 Gbyte Memory o 76 Tbyte Diskspace
SGI’2000Parallel Programming Tutorial Speed of Access 1/clock 64reg 32KB (L1) 8MB (L2) ~ s GB Cache subsystemmemory Device Capacity (size) ~4000 cy ~ cy (NUMA) ~10 cy ~2-3 cy disk Memory hierarchy p4p8p16p32p64p128p256p512p Remote Latency (ns) SN-MIPS Latency Origin2000 Latency
SGI’2000Parallel Programming Tutorial I/O Web serving Weather simulation CPU Storage Repository / archive Signal processing Media streaming Traditional big supercomputer Scale in Any and All Dimensions NUMAflex™ Flexible Configuration