Itay Greenspon 2014 HiT Embedded Systems, Holon, Israel Open Spatial Programming (OpenSPL) and Multiscale Dataflow Computing.

Slides:



Advertisements
Similar presentations
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Performance Analysis of Multiprocessor Architectures
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
PRACE Keynote, Linz Oskar Mencer, April 2014 Computing in Space.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
GCSE Computing - The CPU
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Copyright Jim Martin Computers Inside and Out Dr Jim Martin
Technology in Focus: Under the Hood
Computer Performance Computer Engineering Department.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computer Architecture and Organization Introduction.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Chapter 4 MARIE: An Introduction to a Simple Computer.
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,
Full and Para Virtualization
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
GCSE Computing - The CPU
Lecture 2: Performance Today’s topics:
Basic Computer Organization and Design
PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER
Lecture 3: MIPS Instruction Set
Embedded Systems Design
Morgan Kaufmann Publishers
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Hyperthreading Technology
Lecture 2: Performance Today’s topics: Technology wrap-up
Architecture & Organization 1
Computer Architecture
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Performance of computer systems
Final Project presentation
Lecture 3: MIPS Instruction Set
Chapter 1 Computer System Overview
Performance of computer systems
Cache - Optimization.
GCSE Computing - The CPU
CSE 502: Computer Architecture
Presentation transcript:

Itay Greenspon 2014 HiT Embedded Systems, Holon, Israel Open Spatial Programming (OpenSPL) and Multiscale Dataflow Computing

What is OpenSPL OpenSPL models Spatial arithmetic Code examples Implementations Outline 2

3 OpenSPL Introduction Video

A program is a sequence of instructions Performance is dominated by: – Memory latency – ALU availability 4 Temporal Computing (1D) CPU Time Get Inst. 1 Memory COMPCOMP Read data 1 Write Result 1 COMPCOMP Read data 2 Write Result 2 COMPCOMP Read data 3 Write Result 3 Actual computation time Get Inst. 2 Get Inst. 3

5 Spatial Computing (2D) data in data in ALU Buffer ALU Control ALU Control ALU data out data out Synchronous data movement Time Read data [1..N] Computation Write results [1..N] Throughput dominated

OpenSPL 6 Founding Corporations: Founding Academic Partners: launched on Dec 9, 2013

7 New CME Electronic Trading Gateway will be going live in March 2014! Webinar Page: CME Group Inc. (Chicago Mercantile Exchange) is one of the largest options and futures exchanges. It owns and operates large derivatives and futures exchanges in Chicago, and New York City, as well as online trading platforms. It also owns the Dow Jones stock and financial indexes, and CME Clearing Services, which provides settlement and clearing of exchange trades. …. [from Wikipedia] OpenSPL in Practice

OpenSPL - Why Now? 8 Semiconductor technology is ready – Within ten years (2003 to 2013) the number of transistors on a chip went up from 400M (Itanium 2) to 5Bln (Xeon Phi) Memory performance isn’t keeping up – Memory density has followed the trend set by Moore’s law – But Memory latency has increased from 10s to 100s of CPU clock cycles – As a result, On-die cache % of total die area has increased from 15% (1um) to 40% (32nm) – The memory latency gap could eliminate most of the benefits of CPU improvements Exascale challenges (10^18 FLOPS) – clock frequencies stagnated in the few GHz range – energy usage and Power wastage of modern HPC systems are becoming a huge economic burden that can not be ignored any longer – requirements for annual performance improvements grow steadily – programmers continue to rely on sequential execution (1D approach) For affordable exascale systems  Novel approach is needed

OpenSPL Basics 9 Control and Data-flows are decoupled – both are fully programmable – can run in parallel for maximum performance Operations exist in space and by default run in parallel – their number is limited only by the available space All operations can be customized at various levels – e.g., from algorithm down to the number representation Data sets (actions) streams through the operations The data transport and processing can be matched

OpenSPL Models 10 Memory: – Fast Memory (FMEM): many, small in size, low latency – Large Memory (LMEM): few, large in size, high latency – Scalars: many, tiny, lowest latency, fixed during exec. Execution: – datasets + scalar settings sent as atomic “actions” – all data flows through the system synchronously in “ticks” Programming: – API allows construction of a graph computation – meta-programming allows complex construction

OpenSPL Machine 11 A spatial computing machine system consists of: – appropriate hardware technology, a.k.a. the Spatial Computing Substrate (SCS) (flexible arithmetic/computation units and interconnect) – an SCS specific compilation tool-chain – CPU-based runtime for control of SCS Computation divided into discrete kernels interconnected by data flow streams to form bigger entities In a spatial system one or more SCS engines exist, each executing a single action at any moment in time

x x + 30 y SCSVar x = io.input("x", scsInt(32)); SCSVar result = x * x + 30; io.output("y", result, scsInt(32)); 12 OpenSPL Example: X

OpenSPL Example: Moving Average 13 SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17)); Y = (X n-1 + X + X n+1 ) / 3

OpenSPL Example: Choices 14 x + 1 y - 1 > 10 SCSVar x = io.input(“x”, scsUInt(24)); SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24));

Spatial Arithmetic 15 Operations instantiated as separate arithmetic units Units along data paths use custom arithmetic and number representation The above may reduce individual unit sizes – can maximize the number that fit on a given SCS Data rates of memory and I/O communication may also be maximized due to scaled down data sizes SSSSSSS s Exponent (8)Mantissa (23) SSS s Exponent (3) Mantissa (10) Potentially optimal encoding

Spatial Arithmetic at All Levels 16 Arithmetic optimizations at the bit level – e.g., minimizing the number of ’1’s in binary numbers, leading to linear savings of both space and power (the zeros are omitted in the implementation) Higher level arithmetic optimizations – e.g., in matrix algebra, the location of all non-zero elements in sparse matrix computations is important Spatial encoding of data structures can reduce transfers between memory and computational units (boost performance and improve efficiency) – In temporal computing encoding and decoding would take time and eventually can cancel out all of the advantages – In spatial computing, encoding and decoding just consume a bit more of additional space

Spatial computing systems generate one result during every tick SC system efficiency is strongly determined by how efficiently data can be fed from external sources Fair comparison metrics are needed, among others: – computations per cubic foot of datacenter space – computations per Watt – operational costs per computation 17 Benchmarking Spatial Computers

Multiscale Dataflow Engine (DFE) by Maxeler is the first SCS implementation, used by: – Chevron – ENI – JP Morgan – CME Group Open research areas – map on to CPUs (e.g. using OpenMP/MPI) – GPUs – other accelerator technology 18 SCS Implementation