Research in Edge Computing

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Chapter 04 Authors: John Hennessy & David Patterson.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

My Coordinates Office EM G.27 contact time:

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

General Purpose computing on Graphics Processing Units

Computer Graphics Graphics Hardware

Computer Organization and Architecture Lecture 1 : Introduction

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

Yang Gao and Dr. Jason D. Bakos

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Employing compression solutions under openacc

William Stallings Computer Organization and Architecture 8th Edition

Operating Systems (CS 340 D)

Cell Architecture.

Assembly Language for Intel-Based Computers, 5th Edition

Architecture & Organization 1

Graphics Processing Unit

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Cache memory Direct Cache Memory Associate Cache Memory

Clusters of Computational Accelerators

GP2: General Purpose Computation using Graphics Processors

Mattan Erez The University of Texas at Austin

Introduction to cosynthesis Rabi Mahapatra CSCE617

Linchuan Chen, Peng Jiang and Gagan Agrawal

Architecture & Organization 1

NVIDIA Fermi Architecture

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

STUDY AND IMPLEMENTATION

Chapter 2: Data Manipulation

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Computer Graphics Graphics Hardware

Graphics Processing Unit

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Research in Edge Computing PI: Dinesh Manocha Co-PI: Ming C. Lin UNC Chapel Hill PM: Larry Skelly DTO

DTO Contract Period: October 01, 2006 – September 30, 2007 Administered through RDECOM

Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining

Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining

Cell Processor Cell-based products increasingly visible PlayStation 3 Dual-Cell blades for IBM blade centers Accelerator cards Cell clusters for supercomputing, i.e. LANL RoadRunner Very high-speed computation possible when programmed carefully Roadrunner - PetaFLOP machine HDTV

Cell Processor: Overview Cell Architecture PowerPC Element Synergistic Processing Elements Interconnect and Memory Access Cell Programming Techniques SIMD operations Data access and storage use Application Partitioning Compilers and Tools

Introduction to the Cell BE CBE or “Cell Broadband Engine” Also known as: BE, Cell processor CBE includes: PPC core with “traditional” memory subsystem 8 “synergistic processing elements” very high bandwidth internal Element Interconnect Bus (EIB) I/O interfaces (2)

The PowerPC Element “PPE” – 64bit PowerPC core with VMX standard 64bit PowerPC architecture Nominally intended for: control functions OS and management of system resources (including SPE “threads”) enabling legacy software with good performance on control code Standard hierarchical memory L1, L2 on chip Simple microarchitecture dual-thread in-order, dual issue VMX (AltiVec) 128-bit SIMD vector registers and operations

The Synergistic Processing Element Synergistic Processing Unit (SPU) compute-optimized processing elements 128bit-wide data path 128 128bit-wide registers Branch hint (no branch prediction) Local Store (LS) 256KB local to SPU includes instructions & data flat memory (no caches) No protection Memory Flow Controller (MFC) communication (DMA, mailbox) synchronization between SPE and other processing elements in the system

Element Interconnect Bus EIB consists of four rings: each ring is 16B wide two clockwise rings, two counter clockwise rings 12 units attached to rings 8 SPEs PPC MIF (external main memory) 2 I/O interfaces Each unit attached to all four rings Peak bus bandwidth is 96B per core cycle each way four rings, max 3 simultaneous transfers per ring at 8B per core cycle each

The Cell architecture vs The Cell architecture vs. Conventional CPU (Pentium): The Memory Subsystem SPE memory architecture each SPU has its own “flat” local memory management of this memory is by explicit software control code and data moved into and out of this memory by DMA programming the DMA is explicit in SPU code (or PPC code) Many DMA transactions can be “in flight” simultaneously SPE can have 16 simultaneous outstanding DMA requests DMA latencies can be hidden using multiple buffers and loop blocking in code in contrast to traditional, hierarchical memory architectures that support few simultaneous memory transactions Implications for programming the CBE applications must be partitioned across the processing elements, taking into account the limited local memory available to each SPU

GPU vs. Cell GPU Cell Coprocessor to a host CPU off-chip Host processor (PPU) on-chip Porting code not an option; must rewrite code from scratch Code ports to PPU with no major modifications (but not so for SPU) Graphics-centric data operations Vectors of standard variable types Communication using device memory Fast on-chip ring bus between cores Cache sizes small Cache / SPE LS size known Not completely IEEE FP compliant

Next Gen. GPU vs. Cell AMD/ATI Fusion CPU and GPU(s) together on one chip Other accelerators (i.e. video decoders) on chip How does this compare to PPE + SPE(s)? nVIDIA G80 Features CUDA programming model and tools Massively multi-threaded GPUs evolving into Many-core computing (e.g. Intel/Microsoft)

Cell: SIMD Vector Instructions Example is a 4-wide add each of the 4 elements in reg VA is added to the corresponding element in reg VB the 4 results are placed in the appropriate slots in reg VC SIMD programming in C intrinsics provide access to SIMD assembler instructions, e.g. c = spu_add(a,b)  add vc,va,vb

Data Parallel Programming: Application Partitioning SPMD (Single Program Multiple Data) approach Data blocks partitioned into as many sub-blocks as SPEs May require coordination among SPEs between functions e.g. if there is interaction between data sub-blocks Essentially all data movement is SPE-to main memory or main memory-to-SPE

Building Code for Cell Blades Executables for PPE and SPE are built separately SPE executable is “shrink-wrapped” inside PPE executable Compilers for PPE gcc xlc essentially standard compilers targeting PowerPC Compilers for SPE gcc (from Sony) xlc (“preliminary” version) Debugging gdb support static analysis

Acknowledgements Jeff Derby, STSM, IBM STG Ken Vu, STSM, IBM STG Source of some diagrams (used with permission) Ken Vu, STSM, IBM STG Dave Luebke, Steve Molnar, NVIDIA Jan Prins, Stephen Olivier, UNC-CH Computer Science

Cell Resources Useful websites: developerWorks: http://www-128.ibm.com/developerworks/power/cell/ SDK: http://www.alphaworks.ibm.com/tech/cellsw Simulator: http://www.alphaworks.ibm.com/tech/cellsystemsim XLC/C++: http://www.alphaworks.ibm.com/tech/cellcompiler BSC Linux for Cell: http://www.bsc.es/projects/deepcomputing/linuxoncell/ Techlib: http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine Support forum: http://www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=739&cat=28

Cell Processor for Edge Computing Pros: Cell has very nice architectural features Some limited support from IBM and 3rd Party vendors (e.g. Mercury) Possibilities as accelerator for high-performance computing Cons: Programming support for Cell is very limited The only killer app. for Cell is PS3 Not much is known about future generations of Cell processor Recent trend of GPGPU and many-core processing is more promising

Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining

Proceedings of IEEE Based on Edge Computing Workshop held at UNC Chapel Hill in May 2006 Co-editors: Ming C. Lin Michael Macedonia Dinesh Manocha

Proceedings of IEEE: Schedule Proposal submitted to IEEE: Aug 2006 Proposal approved: Sep 2006 Author invitations: Invited 22 authors to submit papers (Oct.’06 – Jan’06): Academic researchers in computer architecture, high performance computing, compilers, GPUs, etc. Industrial researchers: Intel, Microsoft, NVIDIA, ATI/AMD, Ericcson

Proceedings of IEEE: Current Status 13 invitations accepted 6 article submissions so far (undergoing review) Rest of the articles expected during May-June 2007 Article details available at: http://gamma.cs.unc.edu/PROC_IEEE/

Proceedings of IEEE: Next steps Article review process Closely work with the authors and revisions Have final versions ready (by Sep-Oct’07) Co-editors write an overview article Send to IEEE for final publication (Dec’07: expected timeframe) Risks: Delays on part of the authors; IEEE publication schedule.

Goals Comparison of Cell vs. GPU features for Edge Computing Proceedings of IEEE: Special issue on Edge Computing GPU-based algorithms for numerical computations and data mining

Numerical algorithms using GPUs Dense matrix computations LU decomposition SVD computation QR decomposition Sorting FFT Computations Thousands of downloads from our WWW sites

Numerical algorithms using GPUs Fundamental Numerical Algorithms: LU and SVD algorithms Fast Fourier transforms (FFT) Dense matrix multiplication Sparse Matrix Vector multiplication (SpMV) Advantages of Our Algorithms: Efficient mapping to the rasterization operations on GPUs Performance progress at GPU growth rate Cache-efficient input data reordering Exploit the high memory bandwidth Overview: Graphics processors are programmable vector processors with high memory bandwidth. Our novel numerical algorithms on GPUs achieve high performance using: Efficient memory representations Pipelined data processing to different GPU stages Data parallel nested-loop operations Simple cache-based GPU memory model Single-precision floating point implementations High Memory Bandwidth Efficiency: 32 – 55 GB/s observed memory throughput in our algorithms Exploiting GPU cache architecture We achieve 40 GB/s and 3.37 GFLOPS with our SpMV implementation on Nvidia G80 Application Performance Growth Rate: Performance on 3 generations NVIDIA GeForce GPUs, released in 2004, 2005 and 2006 Improvement of 1.3 – 3× per year 3 – 17 single-precision GFLOPS Comparable Performance: IMKL optimized routines with 4 threads on $2,000 dual Opteron 280 processor ≈ Our algorithms on $500 NVIDIA 7900 GTX GPU

Current Focus: Sparse Matrix Computations Number of non-zero elements in the matrix is much less than total size Data mining computations WWW search Data streaming: multi-dimensional scaling Scientific computation Differential equation solvers VLSI CAD simulation

Motivation Latest GPUs have very high computing power and memory bandwidth On NVIDIA GeForce 8800 GTX (G80) Peak performance 330 GFLOPS Peak memory bandwidth 86 GB/s Latest GPUs are Extremely flexible and programmable CUDA on NVIDIA GPUs CTM on ATI GPUs Sparse matrix operations can be parallelized and implemented on the GPU

Applications Numerical methods Linear systems Least square problems Eigen value problems Scientific computing and Engineering Fluid Simulation Data mining

Sparse Matrix: Challenges Poor temporal and spatial locality Indirect and irregular memory accesses Distributing computation and memory hierarchy optimizations are challenging To get very high performance from GPUs We need to have very high arithmetic intensity relative to the memory accesses Distributing computation into many cores and obtaining memory hierarchy optimizations for Sparse matrix algorithms on the GPU is challenging.

Sparse matrix representations CSR representation Compressed Sparse Row Array of non zero elements and corresponding column indices Row pointers Block CSR (BCSR) representation Divide into blocks Each block treated like a dense matrix

SpMV using CSR Each element in the product (A * b): Dot product of sparse row and dense vector Accessing elements in A are sequential but in b are random. Only some elements in b are required for the dot product. CPU Implementation: Each dot product is computed sequentially GPU Implementation: Each multiplication in the dot product is computed in parallel using many processors in the GPU

Approach CSR representation for the sparse matrix vector multiplication Computation is distributed evenly among many threads

Our Approach values: nonzero elements in the sparse matrix Each cell contains the row id and column id Each thread computes one multiplication values R1 R1 R2 R3 R3 R3 R4 R4 R4 Row 1 has 2 elements Row 2 has 1 element T1 T2 T3 .. .. .. .. .. Tn

Approach (continued) Each computation (thread) distributed among the processing units (continued) T1 T2 T3 .. .. .. .. .. .. .. .. .. .. .. .. ..

Current Results Performance numbers Memory bandwidth : 40.32 GB/s Floating point operations : 3.37 GFLOPS Our GPU implementation is 30 times faster than our CPU implementation (preliminary analysis)

Ongoing work Register and Cache blocking Reorganization of data structure and computation to improve reuse of values Sparse matrix singular value decomposition (SVD) on the GPU Extensive performance analysis Application: Conjugate gradient

Ongoing work Extensive performance analysis on different kind of sparse matrices (Benchmarking) NAS Parallel benchmark Developed for the performance evolution of highly parallel supercomputers Mimic the characteristics of large scale computational fluid dynamics application Matrix Market – NIST A visual repository of test data SPARSITY – UCB/LLNL OSKI (Optimized Sparse Kernel Interface) – UCB