Presentation is loading. Please wait.

Presentation is loading. Please wait.

9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,

Similar presentations


Presentation on theme: "9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,"— Presentation transcript:

1 9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept., IBM Research Work with Gordon Fossum, Jizhu Lu & Billy Robinson

2 Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Experimental results

3 Estimating 3D IMA Run Time (Kaplan, et al., 2006)
9/18/2018 Estimating 3D IMA Run Time (Kaplan, et al., 2006) Assume 100 shots, 100 receivers (for both surface dimensions) 1000 pseudo-depth points, 1000 output frequencies Then IMA operations: ~1023 BlueGene/L: 50x1012 FLOPS IMA Runtime: ~300 years! But… Reduced parameter set IMA Runtime: ~ years LANL Roadrunner Project: Sustained PetaFLOP with today’s Cell IMA Runtime: ~ years New Cell processor 2010: IMA Runtime: ~ years Feasible by 2010!

4 Power Density – The fundamental problem
9/18/2018 Power Density – The fundamental problem This is chip power, characterized in a design-independent manner. Active power is due to transistor switching; this is “useful” power in the sense that computation is occurring. Passive power, or leakage power, is “wasted” in that it is consumed in the absence of computation. Messages: Passive power was not a problem a few years ago (note the 1994 value), but now it is as much of a concern as active power Passive power is escalating faster than active power Passive power values are nominal, 3 sigma values are slightly higher Current cooling limit is approximately W/cm^2 Gate leakage dictated by gate oxide material Sub-threshold leakage determined by threshold voltage and channel length

5 What’s causing the problem?
9/18/2018 What’s causing the problem? 65 nM 1000 100 10 Gate Stack Power Density (W/cm2) 1 0.1 This is chip power, characterized in a design-independent manner. Active power is due to transistor switching; this is “useful” power in the sense that computation is occurring. Passive power, or leakage power, is “wasted” in that it is consumed in the absence of computation. Messages: Passive power was not a problem a few years ago (note the 1994 value), but now it is as much of a concern as active power Passive power is escalating faster than active power Passive power values are nominal, 3 sigma values are slightly higher Current cooling limit is approximately W/cm^2 Gate leakage dictated by gate oxide material Sub-threshold leakage determined by threshold voltage and channel length 0.01 0.001 1 0.1 0.01 Gate dielectric approaching a fundamental limit (a few atomic layers) Gate Length (microns)

6 Diminishing Returns on Frequency
9/18/2018 Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points

7 Power vs Performance Trade Offs
9/18/2018 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1.45 1.7 .85 1 1.3

8 Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Experimental results

9 Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors

10 Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors

11 Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors Internal Multiples

12 Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors Internal Multiples

13 M-OSRP Internal Multiple Attenuation Algorithm
9/18/2018 M-OSRP Internal Multiple Attenuation Algorithm Internal Multiples Seismic events that have experienced at least one downward reflection from a buried reflector in the Earth Problem Internal multiples confuse the signal we’re interested in Solution Estimate internal multiples with IMA algorithm Subtract internal multiples from other seismic signals BUT… IMA can be extremely costly [Kaplan, et al., 2005] 2D: Order 4 in the # of frequencies 3D: Order 8 in the # of frequencies

14 IMA Computation (following Kaplan, et al. 2005)
9/18/2018 IMA Computation (following Kaplan, et al. 2005) where

15 IMA Definitions are vertical wave numbers.
9/18/2018 IMA Definitions are vertical wave numbers. and are the Fourier transform variables over geophone and source locations respectively. is the angular temporary frequency. is an uncollapsed migration which has been transformed to pseudo-depth by using a constant water velocity . is the pseudo-depth. is a small positive constant.

16 Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results

17 Parallelization Strategy
9/18/2018 Parallelization Strategy Data Nodes Compute Nodes Collection Nodes MPI MPI Data nodes sufficient to hold data (1-8 nodes) Compute nodes as many as possible (1-1024) Collection nodes as many as needed to handle compute output (1-8)

18 Parallelization Strategy
9/18/2018 Parallelization Strategy Data Nodes Compute Nodes Collection Nodes Request b1(k1,*) & b1(k2,*) Receive rho(k1,k2,kg,ks,w) for all w Read b1() Receive b1(k1,*) & b1(k2,*) Receive request Add rho(k1,k2,kg,ks,w) to b3(kg,ks,w) for all w Send b1(k1,*) & b1(k2,*) Compute rho(k1,k2,kg,ks,w) for all w Send rho(k1,k2,kg,ks,w) for all w Save b3(kg,ks,w) Done Done Done

19 Computing Node Control Flow
9/18/2018 Computing Node Control Flow Original Control Flow For each k1 & k2 Tested on Woodcrest Blade Opteron Blade BlueGene/L Nodes For each system evenly partition (k1,k2,kg,ks,w) space over each node Get b1(k1,*) & b1(k2,*) For each kg, ks & w Calculate rho send rho to collection node

20 Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results

21 Introducing Cell BE v1.0 First Generation Cell BE: 90 nm
9/18/2018 Introducing Cell BE v1.0 First Generation Cell BE: 90 nm 241M transistors 235mm2 9 cores, 10 threads >200 GFlops (SP) >20 GFlops (DP) 25 GB/s memory B/W 75 GB/s I/O B/W >300 GB/s EIB Top frequency >4GHz (observed in lab)

22 Heterogeneous Multi-core Architecture
9/18/2018 Heterogeneous Multi-core Architecture

23 1 PPE core: - VMX unit - L1, L2 cache - 2 way SMT
9/18/2018 1 PPE core: - VMX unit - L1, L2 cache way SMT

24 - Dedicated Asynchronous DMA engine
9/18/2018 8 SPEs -128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB - ECC enabled - Dedicated Asynchronous DMA engine

25 Element Interconnect Bus (EIB) - 96B / cycle bandwidth
9/18/2018 Element Interconnect Bus (EIB) - 96B / cycle bandwidth

26 64-bit Power Architecture with VMX
9/18/2018 Cell Processor Review SPE Heterogeneous multi-core system architecture Power Processor Element for control tasks Synergistic Processor Elements for data-intensive processing Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) Data movement and synchronization Interface to high-performance Element Interconnect Bus SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE 16B/cycle 16B/cycle (2x) MIC BIC L2 PPU L1 PXU 32B/cycle 16B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX

27 Computing Node Control Flow
9/18/2018 Computing Node Control Flow Original Control Flow Control Flow on Cell Create SPE Threads DMA task Info For each k1 & k2 For each k1 & k2 Get parameters from PPE Get b1(k1,*) & b1(k2,*) Get b1(k1,*) & b1(k2,*) DMA b1 vectors For each kg, ks & w For each kg & ks For each w Partition ks amongst SPEs Calculate rho Calculate rho DMA rho to memory Synchronize SPEs send rho to collection node Synchronize with PPE Send rho to collection node PPE SPE

28 Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1

29 Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1

30 Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1 Reciprocity:

31 Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results

32 Experimental Setup Blade QS20 QS21 eCell Intel AMD Processor Cell BE
9/18/2018 Experimental Setup Blade QS20 QS21 eCell Intel AMD Processor Cell BE Woodcrest Dual-core Opteron Dual-core 2220 SE Clock 3.2GHz 3.0GHz RAM 1GB 2GB 4GB 8GB (1GB DIMMS) L2 Cache 0.5MG 4MB 2MB Swap 2BG 0GB 8GB Compiler ppuxlc++ v0.8.2, spu-gcc 4.1.1 gcc gcc compiler 3.4.5 MPI mpich mpich gcc64

33 Experimental Parameters
9/18/2018 Experimental Parameters # of source locations: 81 # of receiver locations: 81 Shot spacing: 10 Geophones spacing: 10 Reference wave-speed: Min pseudo-depth: 60 Max pseudo-depth: 250 Epsilon for primary exclusion: 45 Min frequency: 0 Max frequency: 30, 60, 120, 256, 512 # of samples per trace: 512

34 Experiment Results: Speed Up Relative to Cell BE
9/18/2018 Experiment Results: Speed Up Relative to Cell BE Max Frequencies Woodcrest 4 core Opteron 30 1.29 1.71 60 2.11 3.13 120 3.15 4.37 256 5.39 6.01 512 7.19 8.76

35 Experiment Results: BG/L vs Cell BE
9/18/2018 Experiment Results: BG/L vs Cell BE Max Freq Cell Blade BlueGene/L Cell Speed Up Relative to BG/L 1 CPU 8 CPUs 32 64 256 512 1008 60 5.97 893.5 450.4 33.8 20 214 120 121.08 3404 429.6 227 9039.2 4532.2 2299 352 BG/L in double precision BG/L scalar code ~6 Cell Blades = 1 BG/L Rack (2048 CPUs)

36 Thanks! Reference Sam Kaplan Gordon Fossum Jizhu Lu Billy Robinson
9/18/2018 Thanks! Reference Sam Kaplan Gordon Fossum Jizhu Lu Billy Robinson Sam T. Kaplan, Billy Robinson, Kristopher A. Innanen and Arthur B. Weglein, “Optimizing Internal Multiple Attenuation Algorithms for Large Distributed Systems”, Mission-Oriented Seismic Research Program (M-OSRP) Annual Report, pages , 2005.

37 9/18/2018 Backup Slides

38 Vectorization of Partial Integral
9/18/2018 Vectorization of Partial Integral = 4 Partial sums Shuffle Partial sums in blue New 4 values in yellow Use 3 “shuffle” operations to gather yellow data Use 4 adds Add

39 Cell Broadband Engine History
9/18/2018 Cell Broadband Engine History IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opens March 2001 ~$400M Investment, 5 years, 600 people February 7, 2005: First technical disclosures January 12, 2006: Alliance extended 5 additional years


Download ppt "9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,"

Similar presentations


Ads by Google