Husky Energy Chair in Oil and Gas Research Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February 17, 2004 Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland 0 Max Address M U N - February 17, 2005 - Phil Bording
Session 1 History of Design. Tyco Brahe. Napier Session 1 History of Design Tyco Brahe Napier Charles Babbage – mechanical design John Atanasoff – Storage – spinning capacitor - Konrad Zuse - Floating Point Mauchley and Ekert von-Neumann Harvard memory – code memory - data Princeton memory code and data
Session 2 Current Design Issues. Scaling laws. Moore’s Law Session 2 Current Design Issues Scaling laws Moore’s Law Transistors – VLSI Memory – Technology Division of Design The memory Challenge The processor Challenge The ILLIAC – PEPE IBM 7094 IBM 360/44 IBM 360/95 Array Processors the software of array processor calls
Application Specific Machines M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
Computing and Calculating Engines M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Session 1 History Vector memory Pipeline Arithmetic – Array Processing Benchmark Driven Dollars Fairhair Syndrome M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Processors Data Memory Alu Hardwired instructions Processor Bottleneck Memory Bottleneck Vacuum tubes Core Plated Wire Transistors LSI – 6 T Static VLSI - 2 T Dynamic M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Linear Address Space 0 Max Address Address Pointer Latency is the time to access the first word Bandwidth is the rate of accessing successive words M U N - February 17, 2005 - Phil Bording
von Neumann Architecture Princeton Memory Address Pointer Arithmetic Logic Unit (ALU) Data/Instructions Pc = Pc + 1 Program Counter Featuring Deterministic Execution M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording After Gustfason 2004 Bednar, 2004 M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Bank memory design Duplicate memory system One design for subsystem Use a binary tree design to spread out addresses and data Fetch/Store many words at once Assume a sequential addressing pattern M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Bank memory design The wires created a big switch between modules The slower memory access time was better matched to the faster processor times Costly to build – significant effort in engineering M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Array memory design N rows NxN bits N columns M bits on Bus M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Array memory design Streaming data flow, nibbles, bytes, and words Sequential Access First word access time = add+latency+data Successive words = data Random Access Indirect Addressing Non-uniform Strides M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Benchmark Scalar operations Array operations Do loop domination of codes Vendors look seriously at instruction stream Then comes Linpack. LU decomposition If it does matrix multiply fast nothing else matters or does it??? M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Fairhair Syndrome New world class machine is designed at MIT, Stanford, or Caltech Venture Capital flows in Federal Government buys 10 new machines Company goes public Vulture capitalists sell out Federal Government buys new machines from someboldy else -- the next fairhair Company has stock scandal – goes bankrupt M U N - February 17, 2005 - Phil Bording
Session 2 Current Design Issues. Scaling laws. Moore’s Law Session 2 Current Design Issues Scaling laws Moore’s Law Transistors – VLSI Memory – Technology Division of Design The memory Challenge The processor Challenge The ILLIAC – PEPE IBM 7094 IBM 360/44 IBM 360/95 Array Processors the software of array processor calls Programming Models vectors shared memory distributed memory
M U N - February 17, 2005 - Phil Bording Lamda Rules M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Division of design Company A ALU Memory Memory Weak Link ALU One Company Company B M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Moore’s Laws Every 18 months the density of transistors on a VLSI chip doubles The investments of $ doubles with every new VLSI plant M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Illiac 8 X 8 Processors Nearest Neighbor Connections M U N - February 17, 2005 - Phil Bording
Parallel Ensemble Processing Elements - PEPE Radar Processing Computer Associative Computing Data Outputs . . . . P0 Pn-3 Pn-2 Pn-1 Pn Data Inputs M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording IBM Machines Early 1960’s 7094, 36 bit arithmetic 1600 and 1400 processors completely different Middle 1960’s New Machine – IBM 360 36 bit words, but memory parity was added 8 bit byte + 1 bit parity Uniform business machine architectures 32 and 64 bit floating point Not any industry standard for format of floating point M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Array Processors IBM and CDC designed DMA processors – Direct Memory Access Frees the main processor to compute Allows separate simple processors to do the i/o The idea translated into attached processors for arithmetic processing M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Array Processors Arrays of data are moved to a local very high speed memory – fast registers Arithmetic is performed by special instructions passed to array processor CPU Array Processor M U N - February 17, 2005 - Phil Bording
Software Design Issues Vector Programming Cache Programming Message Passing Programming NUMA Programming Grid Programming ALL of these memory operations have a Fixed Cost Code Performance Improvements are dominated by fixed costs M U N - February 17, 2005 - Phil Bording
Hardware Design Issues 10 Years equals 100 Fold Speedup Memory Latency – cost of getting the first word is a constant Wires have failed to scale Bigger cache memories are slower Code Performance Improvements are dominated by fixed costs M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Linear Address Space 0 Max Address Address Pointer Latency is the time to access the first word Bandwidth is the rate of accessing successive words M U N - February 17, 2005 - Phil Bording
von Neumann Architecture Princeton Memory Address Pointer Arithmetic Logic Unit (ALU) Data/Instructions Pc = Pc + 1 Program Counter Featuring Deterministic Execution M U N - February 17, 2005 - Phil Bording
Cache Memory Architecture N T R L Memory Main Memory is large and slow. Cache is much smaller and much faster. Control logic control keeps the main memory coherent. Cache Memory Address Pointer Featuring Non-Deterministic Execution M U N - February 17, 2005 - Phil Bording
Cache Memory - Three Levels Architecture Multi- Gigabytes Large and Slow 160 X Cache Control Logic 2 Gigahertz Clock 2X 8X 16X L3 Cache Memory L2 Cache Memory L1 Cache Memory 32 Kilobytes 128 Kilobytes 16 Megabytes Featuring Really Non-Deterministic Execution Address Pointer M U N - February 17, 2005 - Phil Bording
Programming Models for Parallel Computing M U N - February 17, 2005 - Phil Bording
Distributed Computing Message Passing Interface Program Address Spaces 0 Max 0 Max 0 Max 0 Max Multiple Address Pointers M U N - February 17, 2005 - Phil Bording
Distributed Computing with Message Passing Program Address Spaces Messages Left and Right Multiple Address Pointers M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
Multi-Threading OpenMP Programming Model Global Program Address Space Local Local Local Local 0 n-1 n 2n-1 2n 3n-1 3n 4n-1 Address and Cache Bus with Conflict Resolution Multiple Address Pointers M U N - February 17, 2005 - Phil Bording
Uniqueness of Store Multi-Threading Program Address Space Multiple Address Pointers Duplicate Pointers to the same Location – Conflict on storing a result So who is managing the multiple pointers? It is the programmers responsibility. M U N - February 17, 2005 - Phil Bording
Multiple Bank Memory Systems Memory Banks Bank 0 1 2 3 Starting + 1 +2 +3 Address +N +2N +3N Mod 4 Vector Programming Model M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Trends in Technology M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording A 256 Node SMP Linux Cluster (2001) 512 CPU, 512GB, 6TB SCSI, 1.536 TB Local, GB Ethernet Imagine 20 of these in one room. Bednar, 2004 M U N - February 17, 2005 - Phil Bording
SIZE, COST, and HEAT The EARTH Simulator 3 Megawatts 500 Million US $ It doesn’t simulte global warming, IT CAUSES IT! M U N - February 17, 2005 - Phil Bording Bednar, 2004
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording S L O W E R M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording After Gustfason 2004 Bednar, 2004 M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording GAP M U N - February 17, 2005 - Phil Bording
Computational Earth Sciences M U N - February 17, 2005 - Phil Bording
Atmospheric Modeling and Data Assimilation at the DAO Robert Atlas and the DAO Team Data Assimilation Office, NASA/GSFC IWG, November 2001
M U N - February 17, 2005 - Phil Bording The f-v Dynamical Core Terrain following Lagrangian control-volume vertical discretization of the basic conservation laws: Mass Momentum Total energy 2D horizontal flux-form semi-Lagrangian discretization Genuinely conservative Gibbs oscillation free Absolute vorticity consistently transported with mass dp within the Lagrangian layers. Computationally efficient M U N - February 17, 2005 - Phil Bording
Computational Performance M U N - February 17, 2005 - Phil Bording
Progression in model resolution 1990s: 2o X 2.5o (220 km) 2000: 1o X 1.25o (110 km) 2002: 0.5o X 0.625o (55 km) 2004: 0.25o X 0.36o (28 km) 2006: Geodesic grid finite-volume @ 20 km 2006 - 2010: up to 10 km – hydrostatic assumption starts to break down; this is the transition period to non-hydrostatic dynamics 2010-2020: revolution in computing technology is to take place 2025: global non-hydrostatic cloud-resolving model with 1 km or finer resolution; capable of resolving individual thunderstorms Slides from Bob Atlas Presentation
Numerical Problem Solving M U N - February 17, 2005 - Phil Bording
Problem Solving – 3D Example of Array Addressing Finite Differences – 3D Array Large Memory Requirement Wave Propagation FD-Time Domain Algorithm Psi(i,j,k) = Physical Variables; ? How do we address memory?? Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording
Problem Solving – 3D Example of Array Addressing i-1,j,k i,j,k i+1,j,k Grid Points Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording
Array Addressing by Dimension 1D Array Psi(Lx) Address = (i-1) + base Stride One Data 2D Array Psi(Lx,Ly) Address = (j-1)*Lx+(i-1) + base Stride N Data Stride N*N Data 3D Array Psi(Lx,Ly,Lz) Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording
Cache Memory Access Streams 1D Streams – 100% 1D +/-1 100% 2D +/-1 100% 2D +/-N 80% 2D +/-1 +/-N 26% M U N - February 17, 2005 - Phil Bording
Cache Memory Access Streams 3D +/-1 100% 3D +/-N 80% 3D +/-N*N 28% 3D ALL 7% M U N - February 17, 2005 - Phil Bording
One Big One versus Many Little Ones M U N - February 17, 2005 - Phil Bording
Futures of Micro-poor processors Lots of arithmetic capability, very hard to use Market forces will make them good at painting bit maps on screens M U N - February 17, 2005 - Phil Bording
Futures of Micro-poor processors No relief in Memory Subsystem Design, prefetch will help but not nearly enough A million will cost a Billion, $$$ M U N - February 17, 2005 - Phil Bording
Futures of Micro-poor processors and the Big Switch The Big Switch is the hot spot and no relief is in sight. No telling what the switch will cost?? M U N - February 17, 2005 - Phil Bording
Seismic Modeling and the Inverse Problem M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording 12 Streamers x 5.1 Kilometers Long Data collected for 70 continuous days Over 2300 Square Km. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording 3D Seismic Modeling Large Scale 3D ~200+ Wave Lengths Acoustic and Elastic Wave Equations In-Homogeneous Earth has widely varying parameters. Complexity limits use of 3D elastic modeling Problem Scale Nx=Ny=Nz ~ 1000 Ntime ~ 10,000 Work per Grid Point ~ 100 Number of Seismic Shots per Survey ~ 100,000 Single Survey Simulation is 10^20 Operations. M U N - February 17, 2005 - Phil Bording
The Babbage Difference Engine, circa 1853 M U N - February 17, 2005 - Phil Bording
Wave Equation Difference Engine (WEDE) for Seismic Modeling Four Processors Acoustic Wave Equation My PhD thesis project at the University of Tulsa M U N - February 17, 2005 - Phil Bording
Wave Equation Difference Engine Finite Differences Elastic or Acoustic Wave Equations Regular Grids Sponge/One-Way Wave Equation Boundary Conditions Any Source/Receiver Geometry Explicit 4th order in Time & 8th order in Space? M U N - February 17, 2005 - Phil Bording
Wave Equation Difference Engine No Cache Memory Deterministic Execution Not a MIMD or SIMD or Data Flow Data movement and control matches the algorithm Each grid point has control word Three levels of parallelism, ( Amount of Parallelism) Instruction trees, ~ 10-20 Multiple Instructions with selection, ~2-3 Multiple Grid points, ~Hundreds of Thousands M U N - February 17, 2005 - Phil Bording
Acoustic, Constant Density Density is so constant it does not appear in the equation. C is the P Wave Velocity. The source energy is in src. Psi is the wave field. M U N - February 17, 2005 - Phil Bording
Wave Equation Difference Engine Machine Performance 100 operations in pipeline 1,000,000 grid point processors 100 Megahertz Clock 10^16 Operations per second M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording
Application Specific Parallel Computing Choose carefully an application which is BIG. Find an algorithm which is suitable. Good data locality. Regular structure in data movement High memory data transfers Map the algorithm into hardware M U N - February 17, 2005 - Phil Bording
Application Specific Parallel Computing What it is not! Not suitable for just any algorithm Not general purpose, we will have an efficient but specific memory subsystem. Does not match the alphabet soup, SIMD, MIMD,NUMA, etc M U N - February 17, 2005 - Phil Bording
What do ASP machines need?? VLSI Design Team, fabless and good? Clever Architect for the problem. A very good memory design! M U N - February 17, 2005 - Phil Bording
What do ASP machines do away with?? Language Compilers Outdated junk in the processor design, x86! Cache memories! Non-deterministic execution! M U N - February 17, 2005 - Phil Bording
Multiple Bank Memory Systems Memory Banks Bank 0 1 2 3 Starting + 1 +2 +3 Address +N +2N +3N Mod 4 As many as are needed!!!! M U N - February 17, 2005 - Phil Bording
Pipelined Instruction Trees Each higher level offers parallel operations Pipeline assumes all registers are loaded every cycle Hardwired?? Actually today the instruction trees could be re-configurable using re-programmable cells!!! r = a+b-x*y M U N - February 17, 2005 - Phil Bording
Pipelined Instruction Trees a a b x y b d y + - * * - + Multiple Trees offer the second level of Parallelism M U N - February 17, 2005 - Phil Bording
Three Levels of Parallelism Instruction Trees, Multiple Levels Multiple Results Multiple Grid Point Processors M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Wave Machine M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Imaging Machine M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Wave Equation a) 8th or 10th Order in space b) 4th Order in time, tricky but possible c) Sponge Boundary Conditions, slowly varying weights along sides d) Nominal flat topography, new schemes are building in topography e) Any seismic source location, any geophone location M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Elastic Wave Equation a) Grid point work is about 100 operations b) About 20,000 time steps per shot c) 200 Wavelengths gives about 160,000 geophone locations d) Traces have 4096 samples, 2 milliseconds, could be 1 ms. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Elastic Wave Equation Shots are placed at twice the receiver spacing Number of shots equals 40,000 Model Frequency is velocity dependent, assume something on the order of 60 hertz. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Economics Up Front Fixed Cost, $5 to $ 10 Million Each ASP Chip is $5 to 10 A Petaflop for $5 or $10 Million M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Economics Seismic Shot takes 0.1 seconds 5 Year life is 50,000 Models A realistic 3D elastic seismic model would cost $200 M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Comparison 10 Clusters ~ $10 Million 10 models per year One Waves in Linear Motion Analyzer (WILMA) ~$10 Million 10,000 models per year M U N - February 17, 2005 - Phil Bording
Comparison Waves in Linear Motion Analyzer 1000X faster For the same money!. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Summary 1000 Megawatts is a good sized power station Good memory design is worth the money! Removing the obstacles to efficient computing gives sustainable performance M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Summary Slower is better. Less power is better. High Efficiency is better. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Conclusions Deterministic Computing is important for performance……… Application Specific Computing is a good fit for the wave equation….. And very cost effective……….. M U N - February 17, 2005 - Phil Bording
M U N - February 17, 2005 - Phil Bording Thanks SEG – Continuing Education Memorial University of Newfoundland M U N - February 17, 2005 - Phil Bording