Research in Edge Computing PI: Dinesh Manocha Co-PI: Ming C. Lin UNC Chapel Hill PM: Larry Skelly DTO
DTO Contract Period: October 01, 2006 – September 30, 2007 Administered through RDECOM
Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining
Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining
Cell Processor Cell-based products increasingly visible PlayStation 3 Dual-Cell blades for IBM blade centers Accelerator cards Cell clusters for supercomputing, i.e. LANL RoadRunner Very high-speed computation possible when programmed carefully Roadrunner - PetaFLOP machine HDTV
Cell Processor: Overview Cell Architecture PowerPC Element Synergistic Processing Elements Interconnect and Memory Access Cell Programming Techniques SIMD operations Data access and storage use Application Partitioning Compilers and Tools
Introduction to the Cell BE CBE or “Cell Broadband Engine” Also known as: BE, Cell processor CBE includes: PPC core with “traditional” memory subsystem 8 “synergistic processing elements” very high bandwidth internal Element Interconnect Bus (EIB) I/O interfaces (2)
The PowerPC Element “PPE” – 64bit PowerPC core with VMX standard 64bit PowerPC architecture Nominally intended for: control functions OS and management of system resources (including SPE “threads”) enabling legacy software with good performance on control code Standard hierarchical memory L1, L2 on chip Simple microarchitecture dual-thread in-order, dual issue VMX (AltiVec) 128-bit SIMD vector registers and operations
The Synergistic Processing Element Synergistic Processing Unit (SPU) compute-optimized processing elements 128bit-wide data path 128 128bit-wide registers Branch hint (no branch prediction) Local Store (LS) 256KB local to SPU includes instructions & data flat memory (no caches) No protection Memory Flow Controller (MFC) communication (DMA, mailbox) synchronization between SPE and other processing elements in the system
Element Interconnect Bus EIB consists of four rings: each ring is 16B wide two clockwise rings, two counter clockwise rings 12 units attached to rings 8 SPEs PPC MIF (external main memory) 2 I/O interfaces Each unit attached to all four rings Peak bus bandwidth is 96B per core cycle each way four rings, max 3 simultaneous transfers per ring at 8B per core cycle each
The Cell architecture vs The Cell architecture vs. Conventional CPU (Pentium): The Memory Subsystem SPE memory architecture each SPU has its own “flat” local memory management of this memory is by explicit software control code and data moved into and out of this memory by DMA programming the DMA is explicit in SPU code (or PPC code) Many DMA transactions can be “in flight” simultaneously SPE can have 16 simultaneous outstanding DMA requests DMA latencies can be hidden using multiple buffers and loop blocking in code in contrast to traditional, hierarchical memory architectures that support few simultaneous memory transactions Implications for programming the CBE applications must be partitioned across the processing elements, taking into account the limited local memory available to each SPU
GPU vs. Cell GPU Cell Coprocessor to a host CPU off-chip Host processor (PPU) on-chip Porting code not an option; must rewrite code from scratch Code ports to PPU with no major modifications (but not so for SPU) Graphics-centric data operations Vectors of standard variable types Communication using device memory Fast on-chip ring bus between cores Cache sizes small Cache / SPE LS size known Not completely IEEE FP compliant
Next Gen. GPU vs. Cell AMD/ATI Fusion CPU and GPU(s) together on one chip Other accelerators (i.e. video decoders) on chip How does this compare to PPE + SPE(s)? nVIDIA G80 Features CUDA programming model and tools Massively multi-threaded GPUs evolving into Many-core computing (e.g. Intel/Microsoft)
Cell: SIMD Vector Instructions Example is a 4-wide add each of the 4 elements in reg VA is added to the corresponding element in reg VB the 4 results are placed in the appropriate slots in reg VC SIMD programming in C intrinsics provide access to SIMD assembler instructions, e.g. c = spu_add(a,b) add vc,va,vb
Data Parallel Programming: Application Partitioning SPMD (Single Program Multiple Data) approach Data blocks partitioned into as many sub-blocks as SPEs May require coordination among SPEs between functions e.g. if there is interaction between data sub-blocks Essentially all data movement is SPE-to main memory or main memory-to-SPE
Building Code for Cell Blades Executables for PPE and SPE are built separately SPE executable is “shrink-wrapped” inside PPE executable Compilers for PPE gcc xlc essentially standard compilers targeting PowerPC Compilers for SPE gcc (from Sony) xlc (“preliminary” version) Debugging gdb support static analysis
Acknowledgements Jeff Derby, STSM, IBM STG Ken Vu, STSM, IBM STG Source of some diagrams (used with permission) Ken Vu, STSM, IBM STG Dave Luebke, Steve Molnar, NVIDIA Jan Prins, Stephen Olivier, UNC-CH Computer Science
Cell Resources Useful websites: developerWorks: http://www-128.ibm.com/developerworks/power/cell/ SDK: http://www.alphaworks.ibm.com/tech/cellsw Simulator: http://www.alphaworks.ibm.com/tech/cellsystemsim XLC/C++: http://www.alphaworks.ibm.com/tech/cellcompiler BSC Linux for Cell: http://www.bsc.es/projects/deepcomputing/linuxoncell/ Techlib: http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine Support forum: http://www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=739&cat=28
Cell Processor for Edge Computing Pros: Cell has very nice architectural features Some limited support from IBM and 3rd Party vendors (e.g. Mercury) Possibilities as accelerator for high-performance computing Cons: Programming support for Cell is very limited The only killer app. for Cell is PS3 Not much is known about future generations of Cell processor Recent trend of GPGPU and many-core processing is more promising
Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining
Proceedings of IEEE Based on Edge Computing Workshop held at UNC Chapel Hill in May 2006 Co-editors: Ming C. Lin Michael Macedonia Dinesh Manocha
Proceedings of IEEE: Schedule Proposal submitted to IEEE: Aug 2006 Proposal approved: Sep 2006 Author invitations: Invited 22 authors to submit papers (Oct.’06 – Jan’06): Academic researchers in computer architecture, high performance computing, compilers, GPUs, etc. Industrial researchers: Intel, Microsoft, NVIDIA, ATI/AMD, Ericcson
Proceedings of IEEE: Current Status 13 invitations accepted 6 article submissions so far (undergoing review) Rest of the articles expected during May-June 2007 Article details available at: http://gamma.cs.unc.edu/PROC_IEEE/
Proceedings of IEEE: Next steps Article review process Closely work with the authors and revisions Have final versions ready (by Sep-Oct’07) Co-editors write an overview article Send to IEEE for final publication (Dec’07: expected timeframe) Risks: Delays on part of the authors; IEEE publication schedule.
Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining
Numerical algorithms using GPUs Dense matrix computations LU decomposition SVD computation QR decomposition Sorting FFT Computations Thousands of downloads from our WWW sites
Numerical algorithms using GPUs Fundamental Numerical Algorithms: LU and SVD algorithms Fast Fourier transforms (FFT) Dense matrix multiplication Sparse Matrix Vector multiplication (SpMV) Advantages of Our Algorithms: Efficient mapping to the rasterization operations on GPUs Performance progress at GPU growth rate Cache-efficient input data reordering Exploit the high memory bandwidth Overview: Graphics processors are programmable vector processors with high memory bandwidth. Our novel numerical algorithms on GPUs achieve high performance using: Efficient memory representations Pipelined data processing to different GPU stages Data parallel nested-loop operations Simple cache-based GPU memory model Single-precision floating point implementations High Memory Bandwidth Efficiency: 32 – 55 GB/s observed memory throughput in our algorithms Exploiting GPU cache architecture We achieve 40 GB/s and 3.37 GFLOPS with our SpMV implementation on Nvidia G80 Application Performance Growth Rate: Performance on 3 generations NVIDIA GeForce GPUs, released in 2004, 2005 and 2006 Improvement of 1.3 – 3× per year 3 – 17 single-precision GFLOPS Comparable Performance: IMKL optimized routines with 4 threads on $2,000 dual Opteron 280 processor ≈ Our algorithms on $500 NVIDIA 7900 GTX GPU
Current Focus: Sparse Matrix Computations Number of non-zero elements in the matrix is much less than total size Data mining computations WWW search Data streaming: multi-dimensional scaling Scientific computation Differential equation solvers VLSI CAD simulation
Motivation Latest GPUs have very high computing power and memory bandwidth On NVIDIA GeForce 8800 GTX (G80) Peak performance 330 GFLOPS Peak memory bandwidth 86 GB/s Latest GPUs are Extremely flexible and programmable CUDA on NVIDIA GPUs CTM on ATI GPUs Sparse matrix operations can be parallelized and implemented on the GPU
Applications Numerical methods Linear systems Least square problems Eigen value problems Scientific computing and Engineering Fluid Simulation Data mining
Sparse Matrix: Challenges Poor temporal and spatial locality Indirect and irregular memory accesses Distributing computation and memory hierarchy optimizations are challenging To get very high performance from GPUs We need to have very high arithmetic intensity relative to the memory accesses Distributing computation into many cores and obtaining memory hierarchy optimizations for Sparse matrix algorithms on the GPU is challenging.
Sparse matrix representations CSR representation Compressed Sparse Row Array of non zero elements and corresponding column indices Row pointers Block CSR (BCSR) representation Divide into blocks Each block treated like a dense matrix
SpMV using CSR Each element in the product (A * b): Dot product of sparse row and dense vector Accessing elements in A are sequential but in b are random. Only some elements in b are required for the dot product. CPU Implementation: Each dot product is computed sequentially GPU Implementation: Each multiplication in the dot product is computed in parallel using many processors in the GPU
Approach CSR representation for the sparse matrix vector multiplication Computation is distributed evenly among many threads
Our Approach values: nonzero elements in the sparse matrix Each cell contains the row id and column id Each thread computes one multiplication values R1 R1 R2 R3 R3 R3 R4 R4 R4 Row 1 has 2 elements Row 2 has 1 element T1 T2 T3 .. .. .. .. .. Tn
Approach (continued) Each computation (thread) distributed among the processing units (continued) T1 T2 T3 .. .. .. .. .. .. .. .. .. .. .. .. ..
Current Results Performance numbers Memory bandwidth : 40.32 GB/s Floating point operations : 3.37 GFLOPS Our GPU implementation is 30 times faster than our CPU implementation (preliminary analysis)
Ongoing work Register and Cache blocking Reorganization of data structure and computation to improve reuse of values Sparse matrix singular value decomposition (SVD) on the GPU Extensive performance analysis Application: Conjugate gradient
Ongoing work Extensive performance analysis on different kind of sparse matrices (Benchmarking) NAS Parallel benchmark Developed for the performance evolution of highly parallel supercomputers Mimic the characteristics of large scale computational fluid dynamics application Matrix Market – NIST A visual repository of test data SPARSITY – UCB/LLNL OSKI (Optimized Sparse Kernel Interface) – UCB