Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Similar presentations


Presentation on theme: "IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept."— Presentation transcript:

1 IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

2 IBM Research © 2008 2mpp@us.ibm.com Outline  History: Data challenge  Motivation for multicore  Implications for programmers  How Cell addresses these implications  Examples 2D/3D FFT –Medical Imaging, Petroleum, general HPC… Green’s Functions –Seismic Imaging (Petroleum) String Matching –Network Processing: DPI & Intrusion Detections Neural Networks –Finance

3 IBM Research © 2008 3mpp@us.ibm.com Chapter 1: The Beast is Hungry!

4 IBM Research © 2008 4mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources

5 IBM Research © 2008 5mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources  If flops grow faster than pipe capacity… … the beast gets hungrier!

6 IBM Research © 2008 6mpp@us.ibm.com Move the food closer  Example: Intel Tulsa –Xeon MP 7100 series –65nm, 349mm2, 2 Cores –3.4 GHz @ 150W –~54.4 SP GFlops –http://www.intel.com/products /processor/xeon/index.htm  Large cache on chip –~50% of area –Keeps data close for efficient access  If the data is local, the beast is happy! –True for many algorithms

7 IBM Research © 2008 7mpp@us.ibm.com What happens if the beast is still hungry? Data Cache  If the data set doesn’t fit in cache –Cache misses –Memory latency exposed –Performance degraded  Several important application classes don’t fit –Graph searching algorithms –Network security –Natural language processing –Bioinformatics –Many HPC workloads

8 IBM Research © 2008 8mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS

9 IBM Research © 2008 9mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS  But… –Important application working sets are growing faster –Multicore even more demanding on cache than uni-core

10 IBM Research © 2008 10mpp@us.ibm.com Chapter 2: The Beast Has Babies

11 IBM Research © 2008 11mpp@us.ibm.com Power Density – The fundamental problem

12 IBM Research © 2008 12mpp@us.ibm.com What’s causing the problem? Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Power Density (W/cm 2 ) 65 nM Gate Length (microns) 10.010.1 1000 100 10 1 0.1 0.01 0.001 Power, signal jitter, etc...

13 IBM Research © 2008 13mpp@us.ibm.com Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points

14 IBM Research © 2008 14mpp@us.ibm.com Power vs Performance Trade Offs 1 1.45 1.3.85 1.7 We need to adapt our algorithms to get performance out of multicore

15 IBM Research © 2008 15mpp@us.ibm.com Implications of Multicore  There are more mouths to feed – Data movement will take center stage  Complexity of cores will stop increasing … and has started to decrease in some cases  Complexity increases will center around communication  Assumption – Achieving a significant % or peak performance is important

16 IBM Research © 2008 16mpp@us.ibm.com Chapter 3: The Proper Care and Feeding of Hungry Beasts

17 IBM Research © 2008 17mpp@us.ibm.com Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

18 IBM Research © 2008 18mpp@us.ibm.com Feeding the Cell Processor  8 SPEs each with –LS –MFC –SXU  PPE –OS functions –Disk IO –Network IO 16B/cycle (2x)16B/cycle BIC FlexIO TM MIC Dual XDR TM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC

19 IBM Research © 2008 19mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth

20 IBM Research © 2008 20mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth  BOTTOM LINE: It’s all about the data!

21 IBM Research © 2008 21mpp@us.ibm.com Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale)

22 IBM Research © 2008 22mpp@us.ibm.com Memory Managing Processor vs. Traditional General Purpose Processor IBM AMD Intel Cell BE

23 IBM Research © 2008 23mpp@us.ibm.com Examples of Feeding Cell  2D and 3D FFTs  Seismic Imaging  String Matching  Neural Networks (function approximation)

24 IBM Research © 2008 24mpp@us.ibm.com Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer  SIMDized data  DMAs double buffered  Pass 1: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer  Pass 2: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer

25 IBM Research © 2008 25mpp@us.ibm.com 3D FFTs  Long stride trashes cache  Cell DMA allows prefetch Single ElementData envelope Stride 1 Stride N 2 N

26 IBM Research © 2008 26mpp@us.ibm.com Feeding Seismic Imaging to Cell (X,Y)  New G at each (x,y)  Radial symmetry of G reduces BW requirements Data Green’s Function

27 IBM Research © 2008 27mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7

28 IBM Research © 2008 28mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7

29 IBM Research © 2008 29mpp@us.ibm.com Feeding Seismic Imaging to Cell  For each X –Load next column of data –Load next column of indices –For each Y Load Green’s functions SIMDize Green’s functions Compute convolution at (X,Y) –Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2

30 IBM Research © 2008 30mpp@us.ibm.com Feeding String Matching to Cell  Find (lots of) substrings in (long) string  Build graph of words & represent as DFA  Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”

31 IBM Research © 2008 31mpp@us.ibm.com Feeding String Matching to Cell

32 IBM Research © 2008 32mpp@us.ibm.com Hiding Main Memory Latency

33 IBM Research © 2008 33mpp@us.ibm.com Software Multithreading

34 IBM Research © 2008 34mpp@us.ibm.com Feeding Neural Networks to Cell  Neural net function F(X) – RBF, MLP, KNN, etc.  If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X

35 IBM Research © 2008 35mpp@us.ibm.com Convert BW Bound to Compute Bound  Split function over multiple SPEs  Avoids unnecessary memory traffic  Reduce compute time per SPE  Minimal merge overhead Merge

36 IBM Research © 2008 36mpp@us.ibm.com Moral of the Story: It’s All About the Data!  The data problem is growing: multicore  Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching  Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning:Make every core work! – Software multithreading: Keep every core busy!

37 IBM Research © 2008 37mpp@us.ibm.com Backup

38 IBM Research © 2008 38mpp@us.ibm.com Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.


Download ppt "IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept."

Similar presentations


Ads by Google