Research in Edge Computing

Research in Edge Computing
PI: Dinesh Manocha Co-PI: Ming C. Lin UNC Chapel Hill PM: Larry Skelly DTO

DTO Contract Period: October 01, 2006 – September 30, 2007
Administered through RDECOM

Goals Comparison of Cell vs. GPU features for Edge Computing
Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining

Cell Processor Cell-based products increasingly visible
PlayStation 3 Dual-Cell blades for IBM blade centers Accelerator cards Cell clusters for supercomputing, i.e. LANL RoadRunner Very high-speed computation possible when programmed carefully Roadrunner - PetaFLOP machine HDTV

Cell Processor: Overview
Cell Architecture PowerPC Element Synergistic Processing Elements Interconnect and Memory Access Cell Programming Techniques SIMD operations Data access and storage use Application Partitioning Compilers and Tools

Introduction to the Cell BE
CBE or “Cell Broadband Engine” Also known as: BE, Cell processor CBE includes: PPC core with “traditional” memory subsystem 8 “synergistic processing elements” very high bandwidth internal Element Interconnect Bus (EIB) I/O interfaces (2)

The PowerPC Element “PPE” – 64bit PowerPC core with VMX
standard 64bit PowerPC architecture Nominally intended for: control functions OS and management of system resources (including SPE “threads”) enabling legacy software with good performance on control code Standard hierarchical memory L1, L2 on chip Simple microarchitecture dual-thread in-order, dual issue VMX (AltiVec) 128-bit SIMD vector registers and operations

The Synergistic Processing Element
Synergistic Processing Unit (SPU) compute-optimized processing elements 128bit-wide data path bit-wide registers Branch hint (no branch prediction) Local Store (LS) 256KB local to SPU includes instructions & data flat memory (no caches) No protection Memory Flow Controller (MFC) communication (DMA, mailbox) synchronization between SPE and other processing elements in the system

Element Interconnect Bus
EIB consists of four rings: each ring is 16B wide two clockwise rings, two counter clockwise rings 12 units attached to rings 8 SPEs PPC MIF (external main memory) 2 I/O interfaces Each unit attached to all four rings Peak bus bandwidth is 96B per core cycle each way four rings, max 3 simultaneous transfers per ring at 8B per core cycle each

The Cell architecture vs
The Cell architecture vs. Conventional CPU (Pentium): The Memory Subsystem SPE memory architecture each SPU has its own “flat” local memory management of this memory is by explicit software control code and data moved into and out of this memory by DMA programming the DMA is explicit in SPU code (or PPC code) Many DMA transactions can be “in flight” simultaneously SPE can have 16 simultaneous outstanding DMA requests DMA latencies can be hidden using multiple buffers and loop blocking in code in contrast to traditional, hierarchical memory architectures that support few simultaneous memory transactions Implications for programming the CBE applications must be partitioned across the processing elements, taking into account the limited local memory available to each SPU

GPU vs. Cell GPU Cell Coprocessor to a host CPU off-chip
Host processor (PPU) on-chip Porting code not an option; must rewrite code from scratch Code ports to PPU with no major modifications (but not so for SPU) Graphics-centric data operations Vectors of standard variable types Communication using device memory Fast on-chip ring bus between cores Cache sizes small Cache / SPE LS size known Not completely IEEE FP compliant

Next Gen. GPU vs. Cell AMD/ATI Fusion
CPU and GPU(s) together on one chip Other accelerators (i.e. video decoders) on chip How does this compare to PPE + SPE(s)? nVIDIA G80 Features CUDA programming model and tools Massively multi-threaded GPUs evolving into Many-core computing (e.g. Intel/Microsoft)

Cell: SIMD Vector Instructions
Example is a 4-wide add each of the 4 elements in reg VA is added to the corresponding element in reg VB the 4 results are placed in the appropriate slots in reg VC SIMD programming in C intrinsics provide access to SIMD assembler instructions, e.g. c = spu_add(a,b)  add vc,va,vb

Data Parallel Programming: Application Partitioning
SPMD (Single Program Multiple Data) approach Data blocks partitioned into as many sub-blocks as SPEs May require coordination among SPEs between functions e.g. if there is interaction between data sub-blocks Essentially all data movement is SPE-to main memory or main memory-to-SPE

Building Code for Cell Blades
Executables for PPE and SPE are built separately SPE executable is “shrink-wrapped” inside PPE executable Compilers for PPE gcc xlc essentially standard compilers targeting PowerPC Compilers for SPE gcc (from Sony) xlc (“preliminary” version) Debugging gdb support static analysis

Acknowledgements Jeff Derby, STSM, IBM STG Ken Vu, STSM, IBM STG
Source of some diagrams (used with permission) Ken Vu, STSM, IBM STG Dave Luebke, Steve Molnar, NVIDIA Jan Prins, Stephen Olivier, UNC-CH Computer Science

Cell Resources Useful websites:
developerWorks: SDK: Simulator: XLC/C++: BSC Linux for Cell: Techlib: Support forum:

Cell Processor for Edge Computing
Pros: Cell has very nice architectural features Some limited support from IBM and 3rd Party vendors (e.g. Mercury) Possibilities as accelerator for high-performance computing Cons: Programming support for Cell is very limited The only killer app. for Cell is PS3 Not much is known about future generations of Cell processor Recent trend of GPGPU and many-core processing is more promising

Proceedings of IEEE Based on Edge Computing Workshop held at UNC Chapel Hill in May 2006 Co-editors: Ming C. Lin Michael Macedonia Dinesh Manocha

Proceedings of IEEE: Schedule
Proposal submitted to IEEE: Aug 2006 Proposal approved: Sep 2006 Author invitations: Invited 22 authors to submit papers (Oct.’06 – Jan’06): Academic researchers in computer architecture, high performance computing, compilers, GPUs, etc. Industrial researchers: Intel, Microsoft, NVIDIA, ATI/AMD, Ericcson

Proceedings of IEEE: Current Status
13 invitations accepted 6 article submissions so far (undergoing review) Rest of the articles expected during May-June 2007 Article details available at:

Proceedings of IEEE: Next steps
Article review process Closely work with the authors and revisions Have final versions ready (by Sep-Oct’07) Co-editors write an overview article Send to IEEE for final publication (Dec’07: expected timeframe) Risks: Delays on part of the authors; IEEE publication schedule.

Numerical algorithms using GPUs
Dense matrix computations LU decomposition SVD computation QR decomposition Sorting FFT Computations Thousands of downloads from our WWW sites

Numerical algorithms using GPUs
Fundamental Numerical Algorithms: LU and SVD algorithms Fast Fourier transforms (FFT) Dense matrix multiplication Sparse Matrix Vector multiplication (SpMV) Advantages of Our Algorithms: Efficient mapping to the rasterization operations on GPUs Performance progress at GPU growth rate Cache-efficient input data reordering Exploit the high memory bandwidth Overview: Graphics processors are programmable vector processors with high memory bandwidth. Our novel numerical algorithms on GPUs achieve high performance using: Efficient memory representations Pipelined data processing to different GPU stages Data parallel nested-loop operations Simple cache-based GPU memory model Single-precision floating point implementations High Memory Bandwidth Efficiency: 32 – 55 GB/s observed memory throughput in our algorithms Exploiting GPU cache architecture We achieve 40 GB/s and 3.37 GFLOPS with our SpMV implementation on Nvidia G80 Application Performance Growth Rate: Performance on 3 generations NVIDIA GeForce GPUs, released in 2004, 2005 and 2006 Improvement of 1.3 – 3× per year 3 – 17 single-precision GFLOPS Comparable Performance: IMKL optimized routines with 4 threads on $2,000 dual Opteron 280 processor ≈ Our algorithms on $500 NVIDIA 7900 GTX GPU

Current Focus: Sparse Matrix Computations
Number of non-zero elements in the matrix is much less than total size Data mining computations WWW search Data streaming: multi-dimensional scaling Scientific computation Differential equation solvers VLSI CAD simulation

Motivation Latest GPUs have very high computing power and memory bandwidth On NVIDIA GeForce 8800 GTX (G80) Peak performance 330 GFLOPS Peak memory bandwidth 86 GB/s Latest GPUs are Extremely flexible and programmable CUDA on NVIDIA GPUs CTM on ATI GPUs Sparse matrix operations can be parallelized and implemented on the GPU

Applications Numerical methods Linear systems Least square problems
Eigen value problems Scientific computing and Engineering Fluid Simulation Data mining

Sparse Matrix: Challenges
Poor temporal and spatial locality Indirect and irregular memory accesses Distributing computation and memory hierarchy optimizations are challenging To get very high performance from GPUs We need to have very high arithmetic intensity relative to the memory accesses Distributing computation into many cores and obtaining memory hierarchy optimizations for Sparse matrix algorithms on the GPU is challenging.

Sparse matrix representations
CSR representation Compressed Sparse Row Array of non zero elements and corresponding column indices Row pointers Block CSR (BCSR) representation Divide into blocks Each block treated like a dense matrix

SpMV using CSR Each element in the product (A * b):
Dot product of sparse row and dense vector Accessing elements in A are sequential but in b are random. Only some elements in b are required for the dot product. CPU Implementation: Each dot product is computed sequentially GPU Implementation: Each multiplication in the dot product is computed in parallel using many processors in the GPU

Approach CSR representation for the sparse matrix vector multiplication Computation is distributed evenly among many threads

Our Approach values: nonzero elements in the sparse matrix
Each cell contains the row id and column id Each thread computes one multiplication values R R R R R R R R R4 Row 1 has 2 elements Row 2 has 1 element T1 T2 T3 .. .. .. .. .. Tn

Approach (continued) Each computation (thread) distributed among the processing units (continued) T1 T2 T3 .. .. .. .. .. .. .. .. .. .. .. .. ..

Current Results Performance numbers Memory bandwidth : 40.32 GB/s
Floating point operations : 3.37 GFLOPS Our GPU implementation is 30 times faster than our CPU implementation (preliminary analysis)

Ongoing work Register and Cache blocking
Reorganization of data structure and computation to improve reuse of values Sparse matrix singular value decomposition (SVD) on the GPU Extensive performance analysis Application: Conjugate gradient

Ongoing work Extensive performance analysis on different kind of sparse matrices (Benchmarking) NAS Parallel benchmark Developed for the performance evolution of highly parallel supercomputers Mimic the characteristics of large scale computational fluid dynamics application Matrix Market – NIST A visual repository of test data SPARSITY – UCB/LLNL OSKI (Optimized Sparse Kernel Interface) – UCB

Research in Edge Computing

Similar presentations

Presentation on theme: "Research in Edge Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Research in Edge Computing

Similar presentations

Presentation on theme: "Research in Edge Computing"— Presentation transcript:

Similar presentations

About project

Feedback