Systolic Arrays & Their Applications

Slides:



Advertisements
Similar presentations
Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Advertisements

MEMORY popo.
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Load Balancing Parallel Applications on Heterogeneous Platforms.
CSCI 4717/5717 Computer Architecture
Lecture 19: Parallel Algorithms
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
CSCI 311 – Algortihms and Data Structures Steve Guattery Dana 335a (department office) x7-3828
EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Timed Automata.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
System Development. Numerical Techniques for Matrix Inversion.
Numerical Algorithms Matrix multiplication
Henry Hexmoor1 Chapter 7 Henry Hexmoor Registers and RTL.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Why Systolic Architecture ? VLSI Signal Processing 台灣大學電機系 吳安宇.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Lecture 21: Parallel Algorithms
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
Systolic Arrays Presentation at UCF by Jason HandUber February 12, 2003.
Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
What’s on the Motherboard? The two main parts of the CPU are the control unit and the arithmetic logic unit. The control unit retrieves instructions from.
Alternative Parallel Processing Approaches Jonathan Sagabaen.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
Levels of Architecture & Language CHAPTER 1 © copyright Bobby Hoggard / material may not be redistributed without permission.
Computers Are Your Future Eleventh Edition Chapter 2: Inside the System Unit Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.
Chapter One Introduction to Pipelined Processors.
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
CS 345: Chapter 10 Parallelism, Concurrency, and Alternative Models Or, Getting Lots of Stuff Done at Once.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Basic Sequential Components CT101 – Computing Systems Organization.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
EE3A1 Computer Hardware and Digital Design
Chapter 4 MARIE: An Introduction to a Simple Computer.
Lab 2 Parallel processing using NIOS II processors
COMP203/NWEN Memory Technologies 0 Plan for Memory Technologies Topic Static RAM (SRAM) Dynamic RAM (DRAM) Memory Hierarchy DRAM Accelerating Techniques.
Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,
Basic Computer Organization Rashedul Hasan.. Five basic operation No matter what shape, size, cost and speed of computer we are talking about, all computer.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Specialized Virtual Configurable Arrays Dominique Lavenier - Frederic Raimbault IRISA Rennes, France UBS Vannes, France
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Sequential Logic Design
Pipelining and Retiming 1
Assembly Language for Intel-Based Computers, 5th Edition
Lecture 16: Parallel Algorithms I
CSE 370 – Winter Sequential Logic - 1
Assoc. Prof. Dr. Peerapol Yuvapoositanon
Digital Circuits and Logic
Basic Computer Organization
Registers Today we’ll see some common sequential devices: counters and registers. They’re good examples of sequential analysis and design. They are also.
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Presentation transcript:

Systolic Arrays & Their Applications By: Jonathan Break

Overview What it is N-body problem Matrix multiplication Cannon’s method Other applications Conclusion

What Is a Systolic Array? A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East). H. T. Kung and Charles Leiserson were the first to publish a paper on systolic arrays in 1978, and coined the name.

What Is a Systolic Array? A specialized form of parallel computing. Multiple processors connected by short wires. Unlike many forms of parallelism which lose speed through their connection. Cells(processors), compute data and store it independently of each other.

Systolic Unit(cell) Each unit is an independent processor. Every processor has some registers and an ALU. The cells share information with their neighbors, after performing the needed operations on the data.

Some simple examples of systolic array models.

N-body Problem Conventional method: N2 Systolic method: N We will need one processing element for each body, lets call them planets.

B2 B3 B1 B4 B6 B5 Six planets in orbit with different masses, and different forces acting on each other, so are systolic array will look like this.

Each processor will hold the newtonian force formula: Fij = kmimj/d2ij , k being the gravitational constant. Then load the array with values. B1 B2 B3 B4 B5 B6 Now that the array has the mass and the coordinates of Each body we can begin are computation.

Fij = kmimj/d2ij B6, B5, B4, B3, B2, B1 B1m B1c B2m B2c B3m B3c B4m

Fij = kmimj/d2ij B6, B5, B4, B3, B2 B1m B1c B2m B2c B3m B3c B4m B4c

Fij = kmimj/d2ij B6, B5, B4, B3 B1m B1c B2m B2c B3m B3c B4m B4c B5*B1

B6, B5, B4 B1m B1c B2m B2c B3m B3c B4*B1 B5*B2 B6*B3 Fij = kmimj/d2ij

B6, B5 B1m B1c B2m B2c B3*B1 B4*B2 B5*B3 B6*B4 Fij = kmimj/d2ij

B6 B1m B1c B2*B1 B3*B2 B4*B3 B5*B4 B6*B5 Fij = kmimj/d2ij

Done Done Done Done Done Done Fij = kmimj/d2ij

Matrix Multiplication a11 a12 a13 a21 a22 a23 a31 a32 a33 b11 b12 b13 b21 b22 b23 b31 b32 b33 c11 c12 c13 c21 c22 c23 c31 c32 c33 * = Conventional Method: N3 For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Systolic Method This will run in O(n) time! To run in N time we need N x N processing units, in this case we need 9. P1 P2 P3 P4 P5 P6 P7 P8 P9

We need to modify the input data, like so: a13 a12 a11 a23 a22 a21 a33 a32 a31 Flip columns 1 & 3 b31 b32 b33 b21 b22 b23 b11 b12 b13 Flip rows 1 & 3 and finally stagger the data sets for input.

b33 b23 b13 b32 b22 b12 b31 b21 b11 a13 a12 a11 P1 P2 P3 a23 a22 a21 P4 P5 P6 a33 a32 a31 P7 P8 P9 At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.

3 4 2 2 5 3 3 2 5 3 4 2 2 5 3 3 2 5 23 36 28 25 39 34 28 32 37 * = Lets try this using a systolic array. 5 3 2 2 5 4 3 2 2 4 3 P1 P2 P3 3 5 2 P4 P5 P6 5 2 3 P7 P8 P9

Clock tick: 1 5 3 2 2 5 4 3 2 2 4 3*3 3 5 2 5 2 3 P1 P2 P3 P4 P5 P6 P7 P8 P9 9

Clock tick: 2 5 3 2 2 5 3 2 4*2 3*4 3 5 2*3 5 2 3 P1 P2 P3 P4 P5 P6 P7 P8 P9 17 12 6

Clock tick: 3 5 3 2 2*3 4*5 3*2 3 5*2 2*4 5 2 3*3 P1 P2 P3 P4 P5 P6 P7 P8 P9 23 32 6 16 8 9

Clock tick: 4 5 2*2 4*3 3*3 5*5 2*2 5 2*2 3*4 P1 P2 P3 P4 P5 P6 P7 P8 P9 23 36 18 25 33 4 13 12

Clock tick: 5 2*5 3*2 5*3 5*3 5*2 3*2 P1 P2 P3 P4 P5 P6 P7 P8 P9 23 36 28 25 39 19 22 6

23 36 28 25 39 34 32 12 Clock tick: 6 3*5 5*2 2*3 P1 P2 P3 P4 P5 P6 P7

Clock tick: 7 5*5 P1 P2 P3 P4 P5 P6 P7 P8 P9 23 36 28 25 39 34 32 37

23 36 28 25 39 34 32 37 Same answer! In 2n + 1 time, can we do better? The answer is yes, there is an optimization. 23 36 28 25 39 34 28 32 37 P1 P2 P3 P4 P5 P6 P7 P8 P9 23 36 28 25 39 34 32 37

Cannon’s Technique No processor is idle. Instead of feeding the data, it is cycled. Data is staggered, but slightly different than before. Not including the loading which can be done in parallel for a time of 1, this technique is effectively N.

Other Applications Signal processing Image processing Solutions of differential equations Graph algorithms Biological sequence comparison Other computationally intensive tasks

Samba: Systolic Accelerator for Molecular Biological Applications This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.

Why Systolic? Extremely fast. Easily scalable architecture. Can do many tasks single processor machines cannot attain. Turns some exponential problems into linear or polynomial time.

Why Not Systolic? Expensive. Not needed on most applications, they are a highly specialized processor type. Difficult to implement and build.

Summary What a systolic array is. Step through of matrix multiplication. Cannon’s optimization. Other applications including samba. Positives and negatives of systolic arrays.

References “Systolic Algorithms & Architectures” by Patrice Quinton and Yves Robert, 1991 Prentice Hall International “The New Turing Omnibus” by A.K. Dewdney, New York http://www.irisa.fr/symbiose/people/lavenier/Samba/ http://www.cs.hmc.edu/courses/mostRecent/cs156/html08/slides08.pdf http://www.ntu.edu.sg/home/ecrwan/Phdthesi.htm