Husky Energy Chair in Oil and Gas Research

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Processor Technology and Architecture
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Chapter 5: Computer Systems Organization Invitation to Computer Science, C++ Version, Third Edition Added to by S. Steinfadt - Spring 2005 Additional source.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.
Invitation to Computer Science 5th Edition
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Computer Systems Organization CS 1428 Foundations of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture And Organization UNIT-II General System Architecture.
M U N -March 10, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording March 10, 2005.
Computer Organization & Assembly Language © by DR. M. Amer.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording March 8,
Introduction to Microprocessors
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
M U N - February 15, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Chapter 5: Computer Systems Organization Invitation to Computer Science,
Chapter I: Introduction to Computer Science. Computer: is a machine that accepts input data, processes the data and creates output data. This is a specific-purpose.
Internal hardware of a computer Learning Objectives Learn how the processor works Learn about the different types of memory and what memory is used for.
Computer Organization Exam Review CS345 David Monismith.
Introduction to Computers - Hardware
Chapter Overview General Concepts IA-32 Processor Architecture
Chapter 17 Looking “Under the Hood”
William Stallings Computer Organization and Architecture 6th Edition
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Processing
Modeling Big Data Execution speed limited by: Model complexity
ESE532: System-on-a-Chip Architecture
A Closer Look at Instruction Set Architectures
A Common Machine Language for Communication-Exposed Architectures
The Central Processing Unit
Assembly Language for Intel-Based Computers, 5th Edition
Spatial Analysis With Big Data
THE CPU i Bytes 1.1.
Architecture & Organization 1
Morgan Kaufmann Publishers
Instructions at the Lowest Level
Edited by : Noor Alhareqi
Computer Organization
What is Parallel and Distributed computing?
Edited by : Noor Alhareqi
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Chapter 5: Computer Systems Organization
What is Computer Architecture?
Introduction to Microprocessor Programming
COMS 361 Computer Organization
What is Computer Architecture?
Chapter 17 Looking “Under the Hood”
What is Computer Architecture?
Chapter 4 Multiprocessors
A Level Computer Science Topic 5: Computer Architecture and Assembly
Husky Energy Chair in Oil and Gas Research
CSE378 Introduction to Machine Organization
Chapter 4 The Von Neumann Model
Presentation transcript:

Husky Energy Chair in Oil and Gas Research Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February 17, 2004 Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland 0 Max Address M U N - February 17, 2005 - Phil Bording

Session 1 History of Design. Tyco Brahe. Napier Session 1 History of Design Tyco Brahe Napier Charles Babbage – mechanical design John Atanasoff – Storage – spinning capacitor - Konrad Zuse - Floating Point Mauchley and Ekert von-Neumann Harvard memory – code memory - data Princeton memory code and data

Session 2 Current Design Issues. Scaling laws. Moore’s Law Session 2 Current Design Issues Scaling laws Moore’s Law Transistors – VLSI Memory – Technology Division of Design The memory Challenge The processor Challenge The ILLIAC – PEPE IBM 7094 IBM 360/44 IBM 360/95 Array Processors the software of array processor calls

Application Specific Machines M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

Computing and Calculating Engines M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Session 1 History Vector memory Pipeline Arithmetic – Array Processing Benchmark Driven Dollars Fairhair Syndrome M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Processors Data Memory Alu Hardwired instructions Processor Bottleneck Memory Bottleneck Vacuum tubes Core Plated Wire Transistors LSI – 6 T Static VLSI - 2 T Dynamic M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Linear Address Space 0 Max Address Address Pointer Latency is the time to access the first word Bandwidth is the rate of accessing successive words M U N - February 17, 2005 - Phil Bording

von Neumann Architecture Princeton Memory Address Pointer Arithmetic Logic Unit (ALU) Data/Instructions Pc = Pc + 1 Program Counter Featuring Deterministic Execution M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording After Gustfason 2004 Bednar, 2004 M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Bank memory design Duplicate memory system One design for subsystem Use a binary tree design to spread out addresses and data Fetch/Store many words at once Assume a sequential addressing pattern M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Bank memory design The wires created a big switch between modules The slower memory access time was better matched to the faster processor times Costly to build – significant effort in engineering M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Array memory design N rows NxN bits N columns M bits on Bus M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Array memory design Streaming data flow, nibbles, bytes, and words Sequential Access First word access time = add+latency+data Successive words = data Random Access Indirect Addressing Non-uniform Strides M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Benchmark Scalar operations Array operations Do loop domination of codes Vendors look seriously at instruction stream Then comes Linpack. LU decomposition If it does matrix multiply fast nothing else matters or does it??? M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Fairhair Syndrome New world class machine is designed at MIT, Stanford, or Caltech Venture Capital flows in Federal Government buys 10 new machines Company goes public Vulture capitalists sell out Federal Government buys new machines from someboldy else -- the next fairhair Company has stock scandal – goes bankrupt M U N - February 17, 2005 - Phil Bording

Session 2 Current Design Issues. Scaling laws. Moore’s Law Session 2 Current Design Issues Scaling laws Moore’s Law Transistors – VLSI Memory – Technology Division of Design The memory Challenge The processor Challenge The ILLIAC – PEPE IBM 7094 IBM 360/44 IBM 360/95 Array Processors the software of array processor calls Programming Models vectors shared memory distributed memory

M U N - February 17, 2005 - Phil Bording Lamda Rules M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Division of design Company A ALU Memory Memory Weak Link ALU One Company Company B M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Moore’s Laws Every 18 months the density of transistors on a VLSI chip doubles The investments of $ doubles with every new VLSI plant M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Illiac 8 X 8 Processors Nearest Neighbor Connections M U N - February 17, 2005 - Phil Bording

Parallel Ensemble Processing Elements - PEPE Radar Processing Computer Associative Computing Data Outputs . . . . P0 Pn-3 Pn-2 Pn-1 Pn Data Inputs M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording IBM Machines Early 1960’s 7094, 36 bit arithmetic 1600 and 1400 processors completely different Middle 1960’s New Machine – IBM 360 36 bit words, but memory parity was added 8 bit byte + 1 bit parity Uniform business machine architectures 32 and 64 bit floating point Not any industry standard for format of floating point M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Array Processors IBM and CDC designed DMA processors – Direct Memory Access Frees the main processor to compute Allows separate simple processors to do the i/o The idea translated into attached processors for arithmetic processing M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Array Processors Arrays of data are moved to a local very high speed memory – fast registers Arithmetic is performed by special instructions passed to array processor CPU Array Processor M U N - February 17, 2005 - Phil Bording

Software Design Issues Vector Programming Cache Programming Message Passing Programming NUMA Programming Grid Programming ALL of these memory operations have a Fixed Cost Code Performance Improvements are dominated by fixed costs M U N - February 17, 2005 - Phil Bording

Hardware Design Issues 10 Years equals 100 Fold Speedup Memory Latency – cost of getting the first word is a constant Wires have failed to scale Bigger cache memories are slower Code Performance Improvements are dominated by fixed costs M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Linear Address Space 0 Max Address Address Pointer Latency is the time to access the first word Bandwidth is the rate of accessing successive words M U N - February 17, 2005 - Phil Bording

von Neumann Architecture Princeton Memory Address Pointer Arithmetic Logic Unit (ALU) Data/Instructions Pc = Pc + 1 Program Counter Featuring Deterministic Execution M U N - February 17, 2005 - Phil Bording

Cache Memory Architecture N T R L Memory Main Memory is large and slow. Cache is much smaller and much faster. Control logic control keeps the main memory coherent. Cache Memory Address Pointer Featuring Non-Deterministic Execution M U N - February 17, 2005 - Phil Bording

Cache Memory - Three Levels Architecture Multi- Gigabytes Large and Slow 160 X Cache Control Logic 2 Gigahertz Clock 2X 8X 16X L3 Cache Memory L2 Cache Memory L1 Cache Memory 32 Kilobytes 128 Kilobytes 16 Megabytes Featuring Really Non-Deterministic Execution Address Pointer M U N - February 17, 2005 - Phil Bording

Programming Models for Parallel Computing M U N - February 17, 2005 - Phil Bording

Distributed Computing Message Passing Interface Program Address Spaces 0 Max 0 Max 0 Max 0 Max Multiple Address Pointers M U N - February 17, 2005 - Phil Bording

Distributed Computing with Message Passing Program Address Spaces Messages Left and Right Multiple Address Pointers M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

Multi-Threading OpenMP Programming Model Global Program Address Space Local Local Local Local 0 n-1 n 2n-1 2n 3n-1 3n 4n-1 Address and Cache Bus with Conflict Resolution Multiple Address Pointers M U N - February 17, 2005 - Phil Bording

Uniqueness of Store Multi-Threading Program Address Space Multiple Address Pointers Duplicate Pointers to the same Location – Conflict on storing a result So who is managing the multiple pointers? It is the programmers responsibility. M U N - February 17, 2005 - Phil Bording

Multiple Bank Memory Systems Memory Banks Bank 0 1 2 3 Starting + 1 +2 +3 Address +N +2N +3N Mod 4 Vector Programming Model M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Trends in Technology M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording A 256 Node SMP Linux Cluster (2001) 512 CPU, 512GB, 6TB SCSI, 1.536 TB Local, GB Ethernet Imagine 20 of these in one room. Bednar, 2004 M U N - February 17, 2005 - Phil Bording

SIZE, COST, and HEAT The EARTH Simulator 3 Megawatts 500 Million US $ It doesn’t simulte global warming, IT CAUSES IT! M U N - February 17, 2005 - Phil Bording Bednar, 2004

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording S L O W E R M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording After Gustfason 2004 Bednar, 2004 M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording GAP M U N - February 17, 2005 - Phil Bording

Computational Earth Sciences M U N - February 17, 2005 - Phil Bording

Atmospheric Modeling and Data Assimilation at the DAO Robert Atlas and the DAO Team Data Assimilation Office, NASA/GSFC IWG, November 2001

M U N - February 17, 2005 - Phil Bording The f-v Dynamical Core Terrain following Lagrangian control-volume vertical discretization of the basic conservation laws: Mass Momentum Total energy 2D horizontal flux-form semi-Lagrangian discretization Genuinely conservative Gibbs oscillation free Absolute vorticity consistently transported with mass dp within the Lagrangian layers. Computationally efficient M U N - February 17, 2005 - Phil Bording

Computational Performance M U N - February 17, 2005 - Phil Bording

Progression in model resolution 1990s: 2o X 2.5o (220 km) 2000: 1o X 1.25o (110 km) 2002: 0.5o X 0.625o (55 km) 2004: 0.25o X 0.36o (28 km) 2006: Geodesic grid finite-volume @ 20 km 2006 - 2010: up to 10 km – hydrostatic assumption starts to break down; this is the transition period to non-hydrostatic dynamics 2010-2020: revolution in computing technology is to take place 2025: global non-hydrostatic cloud-resolving model with 1 km or finer resolution; capable of resolving individual thunderstorms Slides from Bob Atlas Presentation

Numerical Problem Solving M U N - February 17, 2005 - Phil Bording

Problem Solving – 3D Example of Array Addressing Finite Differences – 3D Array Large Memory Requirement Wave Propagation FD-Time Domain Algorithm Psi(i,j,k) = Physical Variables; ? How do we address memory?? Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording

Problem Solving – 3D Example of Array Addressing i-1,j,k i,j,k i+1,j,k Grid Points Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording

Array Addressing by Dimension 1D Array Psi(Lx) Address = (i-1) + base Stride One Data 2D Array Psi(Lx,Ly) Address = (j-1)*Lx+(i-1) + base Stride N Data Stride N*N Data 3D Array Psi(Lx,Ly,Lz) Address = (k-1)*Lx*Ly +(j-1)*Lx+(i-1) + base M U N - February 17, 2005 - Phil Bording

Cache Memory Access Streams 1D Streams – 100% 1D +/-1 100% 2D +/-1 100% 2D +/-N 80% 2D +/-1 +/-N 26% M U N - February 17, 2005 - Phil Bording

Cache Memory Access Streams 3D +/-1 100% 3D +/-N 80% 3D +/-N*N 28% 3D ALL 7% M U N - February 17, 2005 - Phil Bording

One Big One versus Many Little Ones M U N - February 17, 2005 - Phil Bording

Futures of Micro-poor processors Lots of arithmetic capability, very hard to use Market forces will make them good at painting bit maps on screens M U N - February 17, 2005 - Phil Bording

Futures of Micro-poor processors No relief in Memory Subsystem Design, prefetch will help but not nearly enough A million will cost a Billion, $$$ M U N - February 17, 2005 - Phil Bording

Futures of Micro-poor processors and the Big Switch The Big Switch is the hot spot and no relief is in sight. No telling what the switch will cost?? M U N - February 17, 2005 - Phil Bording

Seismic Modeling and the Inverse Problem M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording 12 Streamers x 5.1 Kilometers Long Data collected for 70 continuous days Over 2300 Square Km. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording 3D Seismic Modeling Large Scale 3D ~200+ Wave Lengths Acoustic and Elastic Wave Equations In-Homogeneous Earth has widely varying parameters. Complexity limits use of 3D elastic modeling Problem Scale Nx=Ny=Nz ~ 1000 Ntime ~ 10,000 Work per Grid Point ~ 100 Number of Seismic Shots per Survey ~ 100,000 Single Survey Simulation is 10^20 Operations. M U N - February 17, 2005 - Phil Bording

The Babbage Difference Engine, circa 1853 M U N - February 17, 2005 - Phil Bording

Wave Equation Difference Engine (WEDE) for Seismic Modeling Four Processors Acoustic Wave Equation My PhD thesis project at the University of Tulsa M U N - February 17, 2005 - Phil Bording

Wave Equation Difference Engine Finite Differences Elastic or Acoustic Wave Equations Regular Grids Sponge/One-Way Wave Equation Boundary Conditions Any Source/Receiver Geometry Explicit 4th order in Time & 8th order in Space? M U N - February 17, 2005 - Phil Bording

Wave Equation Difference Engine No Cache Memory Deterministic Execution Not a MIMD or SIMD or Data Flow Data movement and control matches the algorithm Each grid point has control word Three levels of parallelism, ( Amount of Parallelism) Instruction trees, ~ 10-20 Multiple Instructions with selection, ~2-3 Multiple Grid points, ~Hundreds of Thousands M U N - February 17, 2005 - Phil Bording

Acoustic, Constant Density Density is so constant it does not appear in the equation. C is the P Wave Velocity. The source energy is in src. Psi is the wave field. M U N - February 17, 2005 - Phil Bording

Wave Equation Difference Engine Machine Performance 100 operations in pipeline 1,000,000 grid point processors 100 Megahertz Clock 10^16 Operations per second M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording

Application Specific Parallel Computing Choose carefully an application which is BIG. Find an algorithm which is suitable. Good data locality. Regular structure in data movement High memory data transfers Map the algorithm into hardware M U N - February 17, 2005 - Phil Bording

Application Specific Parallel Computing What it is not! Not suitable for just any algorithm Not general purpose, we will have an efficient but specific memory subsystem. Does not match the alphabet soup, SIMD, MIMD,NUMA, etc M U N - February 17, 2005 - Phil Bording

What do ASP machines need?? VLSI Design Team, fabless and good? Clever Architect for the problem. A very good memory design! M U N - February 17, 2005 - Phil Bording

What do ASP machines do away with?? Language Compilers Outdated junk in the processor design, x86! Cache memories! Non-deterministic execution! M U N - February 17, 2005 - Phil Bording

Multiple Bank Memory Systems Memory Banks Bank 0 1 2 3 Starting + 1 +2 +3 Address +N +2N +3N Mod 4 As many as are needed!!!! M U N - February 17, 2005 - Phil Bording

Pipelined Instruction Trees Each higher level offers parallel operations Pipeline assumes all registers are loaded every cycle Hardwired?? Actually today the instruction trees could be re-configurable using re-programmable cells!!! r = a+b-x*y M U N - February 17, 2005 - Phil Bording

Pipelined Instruction Trees a a b x y b d y + - * * - + Multiple Trees offer the second level of Parallelism M U N - February 17, 2005 - Phil Bording

Three Levels of Parallelism Instruction Trees, Multiple Levels Multiple Results Multiple Grid Point Processors M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Wave Machine M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Imaging Machine M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Wave Equation a) 8th or 10th Order in space b) 4th Order in time, tricky but possible c) Sponge Boundary Conditions, slowly varying weights along sides d) Nominal flat topography, new schemes are building in topography e) Any seismic source location, any geophone location M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Elastic Wave Equation a) Grid point work is about 100 operations b) About 20,000 time steps per shot c) 200 Wavelengths gives about 160,000 geophone locations d) Traces have 4096 samples, 2 milliseconds, could be 1 ms. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Elastic Wave Equation Shots are placed at twice the receiver spacing Number of shots equals 40,000 Model Frequency is velocity dependent, assume something on the order of 60 hertz. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Economics Up Front Fixed Cost, $5 to $ 10 Million Each ASP Chip is $5 to 10 A Petaflop for $5 or $10 Million M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Economics Seismic Shot takes 0.1 seconds 5 Year life is 50,000 Models A realistic 3D elastic seismic model would cost $200 M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Comparison 10 Clusters ~ $10 Million 10 models per year One Waves in Linear Motion Analyzer (WILMA) ~$10 Million 10,000 models per year M U N - February 17, 2005 - Phil Bording

Comparison Waves in Linear Motion Analyzer 1000X faster For the same money!. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Summary 1000 Megawatts is a good sized power station Good memory design is worth the money! Removing the obstacles to efficient computing gives sustainable performance M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Summary Slower is better. Less power is better. High Efficiency is better. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Conclusions Deterministic Computing is important for performance……… Application Specific Computing is a good fit for the wave equation….. And very cost effective……….. M U N - February 17, 2005 - Phil Bording

M U N - February 17, 2005 - Phil Bording Thanks SEG – Continuing Education Memorial University of Newfoundland M U N - February 17, 2005 - Phil Bording