Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
The University of Adelaide, School of Computer Science
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS The Potential of the Cell Processor for Scientific Computing Leonid Oliker Samuel Williams,
Super computers Parallel Processing By Lecturer: Aisha Dawood.
The Potential of the Cell Processor for Scientific Computing
Toolkits version 1.0 Special Cource on Computer Architectures
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Vector computers.
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Yang Gao and Dr. Jason D. Bakos
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Advanced Architectures
Ioannis E. Venetis Department of Computer Engineering and Informatics
Computer Architecture Principles Dr. Mike Frank
Parallel Processing - introduction
5.2 Eleven Advanced Optimizations of Cache Performance
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
/ Computer Architecture and Design
Accelerating PFA FFT: Performance Comparison
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Parallel and Multiprocessor Architectures
Pipelining and Vector Processing
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
/ Computer Architecture and Design
CSC3050 – Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference on Computing Frontiers May 2-6, 2006, Italy Presentation by Aarul Jain

 Introduce a performance model of Cell.  Implement key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs.  Verify results from performance models against published results and implementations of Cell full system simulator.  Compare cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2) and Vector (Cray X1E) architectures.  Propose micro-architectural modifications that could significantly improve the efficiency of double-precision calculations.

 Details and results from the paper. ◦ Programming Model used. ◦ Performance Model used for simulation. ◦ “Cell+” architecture for DP performance improvement. ◦ Dense Matrix-Matrix multiply. ◦ Sparse Matrix Vector multiply. ◦ Stencil Computations. ◦ Fast Fourier Transforms.  Comments/Critiques  Project  Q/A

 Three programming models ◦ Task parallelism. ◦ Pipelined parallelism. ◦ Data parallelism.  Data-parallel programming model used.  Rely heavily on SIMD intrinsic -> NO C.  Double buffering used to overlap data movement with computation on SPEs.  One month to implement first kernel, 600 lines of code.

 Deterministic behavior of software controlled memory.  In-order execution and fixed load-store memory latency of SPEs.  Step1: Segmented code snippets that operate on data present in local store of SPE and did static timing analysis on its assembly.  Step2: a model that tabulates the time required for DMA loads and stores of the operands required by code snippets.  Compute total time by adding all the outer loops where each loop is computed by taking maximum of the snippet and DMA transfer times.

 Double precision operations are implemented using 9-cycle pipelined FMA data path with 4 cycles of overhead for data movement.  6 cycles stall after issuing a DP instruction.  Much detail about Cell+ architecture not discussed in the paper. (Proprietary?)  Propose a design with a longer forwarding network to eliminate all but one stalls.  More details on pipeline of SPE may be found at: ◦ B. Flachs et. al., A Streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, Feb. 2005

 General Matrix Multiply and Add(GEMM) ◦ Column major ◦ Block data layout  Each matrix is broken into 8n x n element tiles designed to fit into the memory available on Cell chip.  Further they are divided into n x n element tiles that can fit into 8 SPE local stores.

 Storage formats ◦ Compressed Sparse Row (CSR) ◦ Blocked Compressed Sparse Row (BCSR)

 Two types of kernels used derived from Chombo and Cactus toolkits.  Both solve 7 point stencils in 3D for each point.

 Compute intensity less than matrix multiplications.  Both 1D and 2D versions analyzed.  Look-up tables used.  No double buffering.

 Broadest quantitative study of Cell’s performance.  Cell’s three level software-controlled memory architecture provides several advantages over mainstream cache-based architectures.  Disadvantage: unaligned load support.  Propose Cell+ architecture for improving DP performance.

 Cell is unique in its architecture -> future architectures based on Cell??  Authors have done considerable work in analyzing Cell performance.  Critique1  Critique2

 Title: FAST FOURIER TRANSFORM IMPLEMENTATION ON CELL BROADBAND ENGINE ARCHITECTURE  Main Objectives: ◦ Explore Cell Architecture and find out limitations/advantages of Cell Architecture. ◦ Get familiar with Cell programming environment.

 ibm.com/developerworks/power/library/pa-cellperf ibm.com/developerworks/power/library/pa-cellperf   html html      html html  er.html er.html  p=&isnumber=&arnumber= p=&isnumber=&arnumber=