IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Computer Abstractions and Technology
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Computer performance.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Introduction CSE 410, Spring 2008 Computer Systems
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Cache Memory Presentation I
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,
Accelerating PFA FFT: Performance Comparison
Hyperthreading Technology
Parallel Computers Today
Architecture & Organization 1
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
CS 3410, Spring 2014 Computer Science Cornell University
Many-Core Graph Workload Analysis
Chapter 4 Multiprocessors
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research © 2008 Outline  History: Data challenge  Motivation for multicore  Implications for programmers  How Cell addresses these implications  Examples 2D/3D FFT –Medical Imaging, Petroleum, general HPC… Green’s Functions –Seismic Imaging (Petroleum) String Matching –Network Processing: DPI & Intrusion Detections Neural Networks –Finance

IBM Research © 2008 Chapter 1: The Beast is Hungry!

IBM Research © 2008 The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources

IBM Research © 2008 The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources  If flops grow faster than pipe capacity… … the beast gets hungrier!

IBM Research © 2008 Move the food closer  Example: Intel Tulsa –Xeon MP 7100 series –65nm, 349mm2, 2 Cores – W –~54.4 SP GFlops – /processor/xeon/index.htm  Large cache on chip –~50% of area –Keeps data close for efficient access  If the data is local, the beast is happy! –True for many algorithms

IBM Research © 2008 What happens if the beast is still hungry? Data Cache  If the data set doesn’t fit in cache –Cache misses –Memory latency exposed –Performance degraded  Several important application classes don’t fit –Graph searching algorithms –Network security –Natural language processing –Bioinformatics –Many HPC workloads

IBM Research © 2008 Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS

IBM Research © 2008 Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS  But… –Important application working sets are growing faster –Multicore even more demanding on cache than uni-core

IBM Research © 2008 Chapter 2: The Beast Has Babies

IBM Research © 2008 Power Density – The fundamental problem

IBM Research © 2008 What’s causing the problem? Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Power Density (W/cm 2 ) 65 nM Gate Length (microns) Power, signal jitter, etc...

IBM Research © 2008 Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points

IBM Research © 2008 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore

IBM Research © 2008 Implications of Multicore  There are more mouths to feed – Data movement will take center stage  Complexity of cores will stop increasing … and has started to decrease in some cases  Complexity increases will center around communication  Assumption – Achieving a significant % or peak performance is important

IBM Research © 2008 Chapter 3: The Proper Care and Feeding of Hungry Beasts

IBM Research © 2008 Cell/B.E. Processor: 200GFLOPS ~70W

IBM Research © 2008 Feeding the Cell Processor  8 SPEs each with –LS –MFC –SXU  PPE –OS functions –Disk IO –Network IO 16B/cycle (2x)16B/cycle BIC FlexIO TM MIC Dual XDR TM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC

IBM Research © 2008 Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth

IBM Research © 2008 Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth  BOTTOM LINE: It’s all about the data!

IBM Research © 2008 Cell Comparison: ~4x the ~½ the power Both 65nm technology (to scale)

IBM Research © 2008 Memory Managing Processor vs. Traditional General Purpose Processor IBM AMD Intel Cell BE

IBM Research © 2008 Examples of Feeding Cell  2D and 3D FFTs  Seismic Imaging  String Matching  Neural Networks (function approximation)

IBM Research © 2008 Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer  SIMDized data  DMAs double buffered  Pass 1: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer  Pass 2: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer

IBM Research © D FFTs  Long stride trashes cache  Cell DMA allows prefetch Single ElementData envelope Stride 1 Stride N 2 N

IBM Research © 2008 Feeding Seismic Imaging to Cell (X,Y)  New G at each (x,y)  Radial symmetry of G reduces BW requirements Data Green’s Function

IBM Research © 2008 Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7

IBM Research © 2008 Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7

IBM Research © 2008 Feeding Seismic Imaging to Cell  For each X –Load next column of data –Load next column of indices –For each Y Load Green’s functions SIMDize Green’s functions Compute convolution at (X,Y) –Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2

IBM Research © 2008 Feeding String Matching to Cell  Find (lots of) substrings in (long) string  Build graph of words & represent as DFA  Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”

IBM Research © 2008 Feeding String Matching to Cell

IBM Research © 2008 Hiding Main Memory Latency

IBM Research © 2008 Software Multithreading

IBM Research © 2008 Feeding Neural Networks to Cell  Neural net function F(X) – RBF, MLP, KNN, etc.  If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X

IBM Research © 2008 Convert BW Bound to Compute Bound  Split function over multiple SPEs  Avoids unnecessary memory traffic  Reduce compute time per SPE  Minimal merge overhead Merge

IBM Research © 2008 Moral of the Story: It’s All About the Data!  The data problem is growing: multicore  Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching  Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning:Make every core work! – Software multithreading: Keep every core busy!

IBM Research © 2008 Backup

IBM Research © 2008 Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.