Download presentation
Presentation is loading. Please wait.
Published byMichael Pope Modified over 6 years ago
1
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept., IBM Research Work with Gordon Fossum, Jizhu Lu & Billy Robinson
2
Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Experimental results
3
Estimating 3D IMA Run Time (Kaplan, et al., 2006)
9/18/2018 Estimating 3D IMA Run Time (Kaplan, et al., 2006) Assume 100 shots, 100 receivers (for both surface dimensions) 1000 pseudo-depth points, 1000 output frequencies Then IMA operations: ~1023 BlueGene/L: 50x1012 FLOPS IMA Runtime: ~300 years! But… Reduced parameter set IMA Runtime: ~ years LANL Roadrunner Project: Sustained PetaFLOP with today’s Cell IMA Runtime: ~ years New Cell processor 2010: IMA Runtime: ~ years Feasible by 2010!
4
Power Density – The fundamental problem
9/18/2018 Power Density – The fundamental problem This is chip power, characterized in a design-independent manner. Active power is due to transistor switching; this is “useful” power in the sense that computation is occurring. Passive power, or leakage power, is “wasted” in that it is consumed in the absence of computation. Messages: Passive power was not a problem a few years ago (note the 1994 value), but now it is as much of a concern as active power Passive power is escalating faster than active power Passive power values are nominal, 3 sigma values are slightly higher Current cooling limit is approximately W/cm^2 Gate leakage dictated by gate oxide material Sub-threshold leakage determined by threshold voltage and channel length
5
What’s causing the problem?
9/18/2018 What’s causing the problem? 65 nM 1000 100 10 Gate Stack Power Density (W/cm2) 1 0.1 This is chip power, characterized in a design-independent manner. Active power is due to transistor switching; this is “useful” power in the sense that computation is occurring. Passive power, or leakage power, is “wasted” in that it is consumed in the absence of computation. Messages: Passive power was not a problem a few years ago (note the 1994 value), but now it is as much of a concern as active power Passive power is escalating faster than active power Passive power values are nominal, 3 sigma values are slightly higher Current cooling limit is approximately W/cm^2 Gate leakage dictated by gate oxide material Sub-threshold leakage determined by threshold voltage and channel length 0.01 0.001 1 0.1 0.01 Gate dielectric approaching a fundamental limit (a few atomic layers) Gate Length (microns)
6
Diminishing Returns on Frequency
9/18/2018 Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points
7
Power vs Performance Trade Offs
9/18/2018 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1.45 1.7 .85 1 1.3
8
Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Experimental results
9
Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors
10
Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors
11
Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors Internal Multiples
12
Internal Multiples Air Gun Surface Receiver Buried reflectors
9/18/2018 Internal Multiples Air Gun Surface Receiver Buried reflectors Internal Multiples
13
M-OSRP Internal Multiple Attenuation Algorithm
9/18/2018 M-OSRP Internal Multiple Attenuation Algorithm Internal Multiples Seismic events that have experienced at least one downward reflection from a buried reflector in the Earth Problem Internal multiples confuse the signal we’re interested in Solution Estimate internal multiples with IMA algorithm Subtract internal multiples from other seismic signals BUT… IMA can be extremely costly [Kaplan, et al., 2005] 2D: Order 4 in the # of frequencies 3D: Order 8 in the # of frequencies
14
IMA Computation (following Kaplan, et al. 2005)
9/18/2018 IMA Computation (following Kaplan, et al. 2005) where
15
IMA Definitions are vertical wave numbers.
9/18/2018 IMA Definitions are vertical wave numbers. and are the Fourier transform variables over geophone and source locations respectively. is the angular temporary frequency. is an uncollapsed migration which has been transformed to pseudo-depth by using a constant water velocity . is the pseudo-depth. is a small positive constant.
16
Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results
17
Parallelization Strategy
9/18/2018 Parallelization Strategy Data Nodes Compute Nodes Collection Nodes … … … MPI MPI Data nodes sufficient to hold data (1-8 nodes) Compute nodes as many as possible (1-1024) Collection nodes as many as needed to handle compute output (1-8)
18
Parallelization Strategy
9/18/2018 Parallelization Strategy Data Nodes Compute Nodes Collection Nodes Request b1(k1,*) & b1(k2,*) Receive rho(k1,k2,kg,ks,w) for all w Read b1() Receive b1(k1,*) & b1(k2,*) Receive request Add rho(k1,k2,kg,ks,w) to b3(kg,ks,w) for all w Send b1(k1,*) & b1(k2,*) Compute rho(k1,k2,kg,ks,w) for all w Send rho(k1,k2,kg,ks,w) for all w Save b3(kg,ks,w) Done Done Done
19
Computing Node Control Flow
9/18/2018 Computing Node Control Flow Original Control Flow For each k1 & k2 Tested on Woodcrest Blade Opteron Blade BlueGene/L Nodes For each system evenly partition (k1,k2,kg,ks,w) space over each node Get b1(k1,*) & b1(k2,*) For each kg, ks & w Calculate rho send rho to collection node
20
Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results
21
Introducing Cell BE v1.0 First Generation Cell BE: 90 nm
9/18/2018 Introducing Cell BE v1.0 First Generation Cell BE: 90 nm 241M transistors 235mm2 9 cores, 10 threads >200 GFlops (SP) >20 GFlops (DP) 25 GB/s memory B/W 75 GB/s I/O B/W >300 GB/s EIB Top frequency >4GHz (observed in lab)
22
Heterogeneous Multi-core Architecture
9/18/2018 Heterogeneous Multi-core Architecture
23
1 PPE core: - VMX unit - L1, L2 cache - 2 way SMT
9/18/2018 1 PPE core: - VMX unit - L1, L2 cache way SMT
24
- Dedicated Asynchronous DMA engine
9/18/2018 8 SPEs -128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB - ECC enabled - Dedicated Asynchronous DMA engine
25
Element Interconnect Bus (EIB) - 96B / cycle bandwidth
9/18/2018 Element Interconnect Bus (EIB) - 96B / cycle bandwidth
26
64-bit Power Architecture with VMX
9/18/2018 Cell Processor Review SPE Heterogeneous multi-core system architecture Power Processor Element for control tasks Synergistic Processor Elements for data-intensive processing Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) Data movement and synchronization Interface to high-performance Element Interconnect Bus SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE 16B/cycle 16B/cycle (2x) MIC BIC L2 PPU L1 PXU 32B/cycle 16B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX
27
Computing Node Control Flow
9/18/2018 Computing Node Control Flow Original Control Flow Control Flow on Cell Create SPE Threads DMA task Info For each k1 & k2 For each k1 & k2 Get parameters from PPE Get b1(k1,*) & b1(k2,*) Get b1(k1,*) & b1(k2,*) DMA b1 vectors For each kg, ks & w For each kg & ks For each w Partition ks amongst SPEs Calculate rho Calculate rho DMA rho to memory Synchronize SPEs send rho to collection node Synchronize with PPE Send rho to collection node PPE SPE
28
Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1
29
Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1
30
Getting Data to the SPEs
9/18/2018 Getting Data to the SPEs Need to calculate DMA data: Stride 1 Reciprocity:
31
Parallelization of IMA Cell Processor Implementation
9/18/2018 Outline Motivation Review of IMA Parallelization of IMA Cell Processor Implementation Vectorization of the code Experimental results
32
Experimental Setup Blade QS20 QS21 eCell Intel AMD Processor Cell BE
9/18/2018 Experimental Setup Blade QS20 QS21 eCell Intel AMD Processor Cell BE Woodcrest Dual-core Opteron Dual-core 2220 SE Clock 3.2GHz 3.0GHz RAM 1GB 2GB 4GB 8GB (1GB DIMMS) L2 Cache 0.5MG 4MB 2MB Swap 2BG 0GB 8GB Compiler ppuxlc++ v0.8.2, spu-gcc 4.1.1 gcc gcc compiler 3.4.5 MPI mpich mpich gcc64
33
Experimental Parameters
9/18/2018 Experimental Parameters # of source locations: 81 # of receiver locations: 81 Shot spacing: 10 Geophones spacing: 10 Reference wave-speed: Min pseudo-depth: 60 Max pseudo-depth: 250 Epsilon for primary exclusion: 45 Min frequency: 0 Max frequency: 30, 60, 120, 256, 512 # of samples per trace: 512
34
Experiment Results: Speed Up Relative to Cell BE
9/18/2018 Experiment Results: Speed Up Relative to Cell BE Max Frequencies Woodcrest 4 core Opteron 30 1.29 1.71 60 2.11 3.13 120 3.15 4.37 256 5.39 6.01 512 7.19 8.76
35
Experiment Results: BG/L vs Cell BE
9/18/2018 Experiment Results: BG/L vs Cell BE Max Freq Cell Blade BlueGene/L Cell Speed Up Relative to BG/L 1 CPU 8 CPUs 32 64 256 512 1008 60 5.97 893.5 450.4 33.8 20 214 120 121.08 3404 429.6 227 9039.2 4532.2 2299 352 BG/L in double precision BG/L scalar code ~6 Cell Blades = 1 BG/L Rack (2048 CPUs)
36
Thanks! Reference Sam Kaplan Gordon Fossum Jizhu Lu Billy Robinson
9/18/2018 Thanks! Reference Sam Kaplan Gordon Fossum Jizhu Lu Billy Robinson Sam T. Kaplan, Billy Robinson, Kristopher A. Innanen and Arthur B. Weglein, “Optimizing Internal Multiple Attenuation Algorithms for Large Distributed Systems”, Mission-Oriented Seismic Research Program (M-OSRP) Annual Report, pages , 2005.
37
9/18/2018 Backup Slides
38
Vectorization of Partial Integral
9/18/2018 Vectorization of Partial Integral = 4 Partial sums Shuffle Partial sums in blue New 4 values in yellow Use 3 “shuffle” operations to gather yellow data Use 4 adds Add
39
Cell Broadband Engine History
9/18/2018 Cell Broadband Engine History IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opens March 2001 ~$400M Investment, 5 years, 600 people February 7, 2005: First technical disclosures January 12, 2006: Alliance extended 5 additional years
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.