Why systolic architectures?

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Why Systolic Architecture ? VLSI Signal Processing 台灣大學電機系吳安宇.

Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 16: Application-Driven Hardware Acceleration (1/4)

Systolic Computing Fundamentals. This is a form of pipelining, sometimes in more than one dimension. Machines have been constructed based on this principle,

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

GPGPU platforms GP - General Purpose computation using GPU

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Sunpyo Hong, Hyesoon Kim

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

1 ECE 486/586 Computer Architecture I Chapter 1 Instructor and You Herbert G. Mayer, PSU Status 7/21/2016.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

SketchVisor: Robust Network Measurement for Software Packet Processing

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Auburn University

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond

OPERATING SYSTEMS CS 3502 Fall 2017

Chapter 7. Classification and Prediction

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

Optical RESERVOIR COMPUTING

Web: Parallel Computing Rabie A. Ramadan , PhD Web:

Hiba Tariq School of Engineering

CORDIC (Coordinate rotation digital computer)

Control Unit Lecture 6.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

A Closer Look at Instruction Set Architectures

Parallel Programming By J. H. Wang May 2, 2017.

MPC and Verifiable Computation on Committed Data

15-740/ Computer Architecture Lecture 7: Pipelining

A Closer Look at Instruction Set Architectures

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Cache Memory Presentation I

Microprocessor and Assembly Language

Morgan Kaufmann Publishers

Database Performance Tuning and Query Optimization

Hash functions Open addressing

Adaptation Behavior of Pipelined Adaptive Filters

Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt

A Test Bed for Manufacturing Planning and Scheduling Discussion of Design Principles Sept Claude Le Pape Director R&D Manufacturing Planning and.

Subject Name: Digital Signal Processing Algorithms & Architecture

Centar ( Global Signal Processing Expo

Introduction to Computer Programming

The sampling of continuous-time signals is an important topic

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

Objective of This Course

Computer Architecture

AN INTRODUCTION ON PARALLEL PROCESSING

Computer Architecture

Computer Evolution and Performance

UNIVERSITY OF MASSACHUSETTS Dept

ECE 352 Digital System Fundamentals

Chapter 11 Database Performance Tuning and Query Optimization

What is Computer Architecture?

UNIVERSITY OF MASSACHUSETTS Dept

Real time signal processing

Model Compression Joseph E. Gonzalez

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

COMP60611 Fundamentals of Parallel and Distributed Systems

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Scalable light field coding using weighted binary images

Mattan Erez The University of Texas at Austin

Presentation transcript:

Why systolic architectures? Sven Gregorio Seminar on Computer Architecture Why systolic architectures? Hsiang-Tsung Kung Carnegie Mellon University IEEE computer, 1982 The paper is 36 years old and some information may be outdated

Background, Problem & Goal

Special-purpose systems and their cost Many high-performance special-purpose systems are produced General-purpose systems aren't always able to meet performance constraints Their cost is composed of design and parts cost Design cost tends to dominate the parts cost Special-purpose systems usually produced in small quantities Special-purpose system are often design ad hoc The designs solve one task only and aren’t generalizable The same errors are often repeated Most notably: I/O imbalance

Why special-purpose systems? There is an interested in speeding up compute-bound computations Compute-bound: #operations > #inputs + #outputs E.g. matrix multiplication Non compute-bound computations are I/O bound E.g. matrix addition These computations tend to be too taxing for CPUs Von Neumann bottleneck: for each operation at least an operand has to be fetched Compute-bound computation become I/O bound Memory bandwidth often isn't enough to keep the CPU pipeline filled Memory accesses are costly in term of energy

Memory access energy cost Dally, HiPEAC 2015 A memory access consumes ~1000X the energy of a complex addition Adapted from Prof. Onur Mutlu’s slides (Computer Architecture FS2018)

The key architectural requirements Simple and regular Decrease the design cost Modular Adjustable to performance goal High concurrency The main way to build faster computer systems Simple communication Tends to get more complex as concurrency increases Balance of computation with I/O The system shouldn’t spend its time waiting for I/O operations

The goal Accumulate the ideas of the author’s previous work Kung had already published multiple papers on systolic architectures Correct the ad hoc approach by providing a general guideline How to map high-level computations to hardware The designs should respect the given requirements Easy to use guideline

Novelty

The conventional approach I/O bandwidth: 10 MB/s Each operation uses 2 bytes At most 5 million operations per second

The systolic approach Same conditions as before Systolic: Up to 6x improvements Systolic: The memory “pumps” data to the processing elements Like the heart pumps blood to the body cells

Both approaches visualized

Key Approach and Ideas

The structure of a systolic architecture A systolic architecture is composed of multiple processing elements (cells) Only cells at the boundary can be I/O ports of the system Partial results and inputs flow inside the system Cells are interconnected to form simple and regular structures: Trees Arrays Grids Image source: Sano K., Nakahara H. (2018) Hardware Algorithms. In: Amano H. (eds) Principles and Structures of FPGAs. Springer, Singapore

Mechanisms

Problems solvable by systolic architectures A sample of problems with known systolic solution: Signal and image processing: Convolution Discrete Fourier transform Interpolation Matrix arithmetic: Matrix multiplication QR decomposition of matrixes Linear systems of equation Non-numeric applications: Regular expressions Dynamic programming Encoders (polynomial division)

An exemplar compute-bound problem The convolution problem Given: The sequence of weights {w1, w2, …, wk} The sequence of inputs {x1, x2, …, xn} Compute: The sequence {y1, y2, …, yn+1-k} Defined by yi = w1 xi + w2 xi+1 + … wk xi+k-1 This problem is regular and compute-bound There are many related problems, e.g. pattern matching

Example convolution problem instance Given: The sequence of weights: {2, 1, 4} The sequence of inputs: {5, 0, -7, 3, 1} The output sequence {y1, y2, y3} is computed as follows y1 = 2*5 + 1*0 + 4*(-7) = -18 y2 = 2*0 + 1*(-7) + 4*3 = 5 y3 = 2*(-7) + 1*3 + 4*1 = -7

The proposed designs Three different systolic systems will be presented: Broadcast: A semi-systolic solution where the input sequence is broadcast to the cells Low-latency: A pure systolic solution with low output latency High-throughput: A pure systolic solution where no cell is idle during usage

1. Broadcast

2. Low-latency

3. High-throughput

Comparison of the designs Nr Design Advantages Disadvantages 1 Broadcast Simplest design Cells use only 3 I/O ports Does NOT scale well 2 Low-latency Simplest pure systolic design Only half of the cells are used at any given time 3 High-throughput Works with unbounded amount of weights The partial results stay in the cells* Requires a bus to collect results More complex than 1 and 2 Response time depends on the number of weights Requires more I/O *Partial results often carry more bits because of numerical accuracy

Key Results: Methodology and Evaluation

Key properties of systolic architectures Criteria of systolic designs and their effects: They have simple and regular control flow Simplicity, modularity, expandability, and high performance They only use a few type of simple cells Simplicity They use each input data item multiple times High performance They are highly concurrent by design Highly scalable Performance increases proportionally with number of cells

Summary

Summary Special-purpose system often Systolic systems Have high design cost Are designed ad hoc Repeat known errors Systolic systems Are simple and easy to design Avoid the pitfalls of special-purpose systems designs Modular, expandable, and high performance Are applicable to many (if not all) problems where it makes sense to build special-purpose systems Systolic systems geared to different applications can be obtained with little effort

Strengths

Strengths of the paper Intuitive idea Well structured paper, with good flow Many different examples of systolic systems are presented The tradeoffs between different designs are discussed General approach to common problems Many compute-bound problems have a systolic solution Scalable and adaptive designs Adaptable to different I/O bandwidth and problem size The paper is still relevant today! (36 years after) More than 3000 citations, ~40 citations/year since 2000 Google’s TPU is a systolic system at its heart

The heart of Google’s first TPU Source: Google product news (17.5.2017)

Performance of Google’s first TPU Source: Google product news (17.5.2017)

Performance of Google’s first TPU Source: Google product news (17.5.2017)

Weaknesses

Weaknesses of the paper No data to support the claim that systolic architectures are a viable alternative to ad hoc architectures Design time Energy efficiency Performance Approach still limited by I/O bottleneck In-memory accelerators don’t share the same bottleneck It’s difficult to design systolic systems for compute-bound problems which aren’t inherently regular Sparse matrix multiplication Difficult to debug Partial result aren’t exposed to the programmer

Thoughts and Ideas

Thoughts and ideas Can the design of systolic architectures be automated? Still an open problem, but some can be designed automatically How can systolic architectures be specified and verified without building prototypes? Are there simulation frameworks for systolic architectures? Systolic architectures map well to FPGAs

Simplified FPGA schematic Source: David Norwood’s master thesis

Thoughts and ideas Are there compute-bound problems with no systolic solution? Are there alternatives to systolic systems? GPU, in-memory accelerators, … Can there be general-purpose systolic structures? Yes, iWarp [1990, CMU & Intel]

“The initial demonstration iWarp system in 1990 is an 8x8 torus” A view of the iWarp “The initial demonstration iWarp system in 1990 is an 8x8 torus” Prof. Thomas Gross (ETHZ) participated in the design of the iWarp

Takeaways

Key takeaways General guideline to simple, efficient, and scalable designs Principled approach to the design of special-purpose systems Avoid designing ad hoc systems when possible Avoid known pitfalls Less successful ideas may have an impact in the future See Google’s TPU

Open Discussion