Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS TFLOPS = floating point ops/sec PFLOPS = 1,000,000,000,000,000 / sec (10 15 )
Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)
Beowulf (18-processor cluster, lab machine)
AMD Opteron quad-core die
The nVidia G80 GPU 128 streaming floating point 1.5 Gb Shared RAM with 86Gb/s bandwidth 500 Gflop on one chip (single precision)
The Computer Architecture Challenge Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices. Originally, because linear algebra is the middleware of scientific computing. Nowadays, mostly for bragging rights. = x P A L U
Top 500 List
Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects
Multicore SMP Systems 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction)
More Detail on GPU Architecture
Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts 1-arch/feeding_the_beast_perrone.pdf
Cray XMT (highly multithreaded shared memory)