Clusters of Computational Accelerators Jan Prins UNC-Chapel Hill
Topics Similarity of accelerator architectures proof-of-concept kernels for high performance applications New application areas for accelerator architectures
Accelerator architectures Existing commodity accelerators Sony/Toshiba/IBM Cell BE Nvidia G80 GPU compute-unified device architecture (CUDA) ATI R600 (almost) Related developments Intel demonstrates 80-core TFlop chip multicore projects March of progress: next generation GPUs Roadrunner to be based on 2nd gen Cell
Cell BE and Nvidia G80 Cell BE GeForce 8800 GTX similarities: 8 cores, local store, vectors/Simd (4 vs 16), high speed device memory differences: Cell integrated PPC, EIB. G80 local caching, extensive multithreading Cell BE GeForce 8800 GTX
Programming for the memory hierarchy Local Memory Global address space Cache ALU Regs length – latency, thickness – bandwidth, aspect ratio – size of transfers simple parallel memory hierarchy (PMH) simple uniprocessor memory hierarchy (UMH)
Accelerator memory hierarchy Device Memory Parallelism Vector / SIMD multithreading multiprocessing Local Store Local Store Vector elts Vector elts Vector elts Vector elts
Programming accelerators Package inherent parallelism available in problem to provide the concurrency and parallel slack needed at every level of PMH serialize where needed to reach appropriate level of reuse Programming models explicit notion of locality CUDA UPC
Clusters of Accelerators Scale PMH Peak Perf Cost Rack Global Address Space 20TF $250K Node Local 400GF $4K CPU L2/L3 core L1 Accelerator Device 200GF $1K SIMD Vector
Proof of concept kernels Demonstrating performance of accelerator clusters challenge is towards the bottom of the parallel memory hierarchy proof-of-concept kernels can establish viability and scaling Example n-body kernels demonstrated to achieve strong performance on Cell and G80 Consequence Folding at home clients developed for Playstation and PCs with high-end ATI GPU. Full GROMACS acceleration on Cell, NAMD acceleration on G80 underway
New application domains Database and datamining operations Stream mining
Stream mining applications Sampling Aggregation Summarization Clustering dimensionality reduction PCA, SVD subspace clustering Classification Anomaly Detection
Challenges Continuous data flow Limited storage space Limited communication bandwidth through hierarchy Detecting and modeling changes Visualization
Conclusions Techniques to effectively exploit accelerator clusters are relatively independent of particular choice of accelerator Application demonstrations can follow spiral development model focusing on implementation of key kernels Data mining and stream mining are important application areas that may be well served by accelerator architectures