Parallelism in Computer Arithmetic:

Slides:

Advertisements

Similar presentations

Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.

Advertisements

Datorteknik ArithmeticCircuits bild 1 Computer arithmetic Somet things you should know about digital arithmetic: Principles Architecture Design.

Henry Hexmoor1 Chapter 5 Arithmetic Functions Arithmetic functions –Operate on binary vectors –Use the same subfunction in each bit position Can design.

UNIVERSITY OF MASSACHUSETTS Dept

EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

Distributed Arithmetic: Implementations and Applications

Introduction to Algorithms

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Fast Fourier Transform Irina Bobkova. Overview I. Polynomials II. The DFT and FFT III. Efficient implementations IV. Some problems.

CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.

1 Chapter 04 Authors: John Hennessy & David Patterson.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

FFT USING OPEN-MP Done by: HUSSEIN SALIM QASIM & Tiba Zaki Abdulhameed

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Multi-operand Addition

The Fast Fourier Transform and Applications to Multiplication

Design of a Reversible Binary Coded Decimal Adder by Using Reversible 4-bit Parallel Adder Babu, H. M. H. Chowdhury, A.R, “Design of a reversible binary.

1 Mathematical Algorithms 1. Arithmetic, (Pseudo) Random Numbers and all that 2. Evaluation & Multiplication of Polynomials I 3. Multiplication of Large.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.

High Computation Mahendra Sharma. Hybrid number representation The hybrid number representations proposed are capable of bounding the maximum length of.

Discrete Fourier Transform

These slides are based on the book:

CDA3101 Recitation Section 5

Prof. Sin-Min Lee Department of Computer Science

Somet things you should know about digital arithmetic:

UNIVERSITY OF MASSACHUSETTS Dept

Conditional-Sum Adders Parallel Prefix Network Adders

DIGITAL SIGNAL PROCESSING ELECTRONICS

UNIVERSITY OF MASSACHUSETTS Dept

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

Divide-and-Conquer Design

VLSI Testing Lecture 14: Built-In Self-Test

Fast Fourier Transforms Dr. Vinu Thomas

VLSI Arithmetic Lecture 5

Butterfly Network A butterfly network consists of (K+1)2^k nodes divided into K+1 Rows, or Ranks. Let node (i,j) refer to the jth node in the ith Rank.

CSE Winter 2001 – Arithmetic Unit - 1

VLSI Arithmetic Lecture 4

VLSI Arithmetic Adders & Multipliers

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

King Fahd University of Petroleum and Minerals

Unconventional Fixed-Radix Number Systems

Arithmetic Circuits (Part I) Randy H

Multiplier-less Multiplication by Constants

Advanced Algorithms Analysis and Design

EFFICIENT ADDERS TO SPEEDUP MODULAR MULTIPLICATION FOR CRYPTOGRAPHY

Multivector and SIMD Computers

A Theoretical Analysis of Square versus Rectangular Component Multipliers in Recursive Multiplication Behrooz Parhami Department of Electrical and Computer.

Instructor: Prof. Chung-Kuan Cheng

A Case for Table-Based Approximate Computing

CS 140 Lecture 14 Standard Combinational Modules

UNIVERSITY OF MASSACHUSETTS Dept

Overview Part 1 – Design Procedure Part 2 – Combinational Logic

Part III The Arithmetic/Logic Unit

CSE 140 Lecture 14 Standard Combinational Modules

Unit –VIII PRAM Algorithms.

UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept

ECE 352 Digital System Fundamentals

Lecture 2 The Art of Concurrency

Lecture 9 Digital VLSI System Design Laboratory

UNIVERSITY OF MASSACHUSETTS Dept

CSCI 256 Data Structures and Algorithm Analysis Lecture 12

Appendix J Authors: John Hennessy & David Patterson.

Major Design Strategies

CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

Major Design Strategies

UNIVERSITY OF MASSACHUSETTS Dept

Conditional-Sum Adders Parallel Prefix Network Adders

Presentation transcript:

Parallelism in Computer Arithmetic: A Historical Perspective 2010s 2000s 1990s 1980s 1970s 1960s 1950s Behrooz Parhami University of California, Santa Barbara Aug. 2018 Parallelism in Computer Arithmetic

About This Presentation This slide show was first developed for an invited talk at a special session on computer arithmetic in honor of Drs. Graham Jullien and William Miller, held on Monday 8/06 at the 61st Midwest Symposium on Circuits and Systems, Windsor, Ontario, Canada, August 5-8, 2018. All rights reserved for the author. ©2018 Behrooz Parhami Edition Released Revised First Aug. 2018 File: http://www.ece.ucsb.edu/~parhami/pres_folder/parh18-mwscas-parallelism-in-comp-arith.ppt Aug. 2018 Parallelism in Computer Arithmetic

Parallelism in Computer Arithmetic: A Historical Perspective Many early parallel processing breakthroughs emerged from the quest for faster and higher-throughput arithmetic operations. Additionally, the influence of arithmetic techniques on parallel computer performance can be seen in diverse areas such the bit-serial arithmetic units of early massively parallel SIMD computers, pipelining and pipeline chaining in vector machines, design of floating-point standards to ensure the accuracy and portability of numerically-intensive programs, and prominence of GPUs in today’s top-of-the-line supercomputers. This paper contains a few representative samples of the many interactions and cross-fertilizations between computer-arithmetic and parallel-computation communities by presenting historical perspectives, case studies of state of art and practice, and directions for further collaboration. Abstract of talk included on this slide for completeness. Aug. 2018 Parallelism in Computer Arithmetic

My Personal Journey and Career 50 years since graduation We are here 1988 1986 1969 Grad 1968 1970 1974 30 years at UCSB My children, 23-33 Aug. 2018 Parallelism in Computer Arithmetic

I. Introduction: What Is Parallelism? The two extreme views: - Any circuit that manipulates multiple bits at once is parallel - Must have concurrency at the level of large functional blocks My view: Parallel processing is possible at the three levels of circuits, function units, and compute nodes I will provide an example at each of the three levels: - Circuit level: Parallel-prefix adders - Function level: Recursive/divide-and-conquer multiplication - System level: Discrete Fourier transform, DFT/FFT The three levels of parallelism are not mutually exclusive and can be readily combined Aug. 2018 Parallelism in Computer Arithmetic

II. Circuit-Level Parallelism Adders and multipliers are our two main workhorses - In this section, I cover parallel-prefix adders - Recursive multiplication is covered in Section III … although it has circuit-level embodiments as well Parallel-prefix computation - Given the inputs x0, x1, x2, x3, … , xk–1 - And an associative binary operator  - Compute all the prefixes of the expression x0  x1  x2  x3  …  xk–1 Example: Indexing via prefix sums 0 1 0 0 1 1 0 1 1 1 2 3 … Aug. 2018 Parallelism in Computer Arithmetic

Share-Nothing vs. Share-Everything Carry Networks x0 x1 x2 x3 cin . . . A Challenge: Find circuit sharing schemes that come close to A in speed and to B in cost B A: Full lookahead. Each carry, and thus sum bit is computed independently and in parallel B: Ripple-carry. Each carry circuit shares the entire circuit of the previous carry Aug. 2018 Parallelism in Computer Arithmetic

The Carry Operator and Block-Propagate/Generate ¢ Parallel-prefix carries Denote (gi, pi) by xi x0  x1  x2  x3  …  xk–1 ¢ ¢ ¢ ¢ ¢ (g[0,2], p[0,2]) … c3 Aug. 2018 Parallelism in Computer Arithmetic

The Brent-Kung Carry Network Aug. 2018 Parallelism in Computer Arithmetic

Design Alternatives and Tradeoffs Brent-Kung: 6 levels 26 cells Kogge-Stone: 4 levels 49 cells Nearly an infinite number of hybrid designs are possible Hybrid: 5 levels 32 cells Aug. 2018 Parallelism in Computer Arithmetic

A Taxonomy of Parallel-Prefix Adders Logic levels = log2k + l Fanout = 2f + 1 From: Harris, David, 2003 http://www.stanford.edu/class/ee371/handouts/harris03.pdf Wire tracks = 2t Aug. 2018 Parallelism in Computer Arithmetic

III. Function-Level Parallelism Multiplication is now just as essential as addition - In this section, I cover divide-and-conquer multiplication - Many other multiplication schemes exist … several of them have parallel-processing connections Recursive multiplication xy = (2k/2xH + xL)(2k/2yH + yL) = 2kxHyH + 2k/2(xHyL + xLyH) + xLyL = 2kp4 + 2k/2(p3 + p2) + p1 Complexity analysis: T(k) = 4T(k/2) + (log k) = (k2) A(k) = A(k/2) + (k) = (k) x y p1 If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). p2 p3 p4 p Aug. 2018 Parallelism in Computer Arithmetic

Analysis of Recursive Multiplication xy = (2k/2xH + xL)(2k/2yH + yL) = 2kxHyH + 2k/2(xHyL + xLyH) + xLyL = 2kp4 + 2k/2(p3 + p2) + p1 Complexity analysis (serial): T(k) = 4T(k/2) + (log k) = (k2) A(k) = A(k/2) + (k) = (k) Complexity analysis (parallel): T(k) = T(k/2) + (log k) = (log k) A(k) = 4A(k/2) + (k) = (k2) Theoretical lower bounds: AT = W(k3/2) AT2 = W(k2) p1 p2 p3 p4 p x y If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Aug. 2018 Parallelism in Computer Arithmetic

The Trick Proposed by Karatsuba and Ofman Recursive multiplication xy = 2kp4 + 2k/2(p3 + p2) + p1 Compute the auxiliary term p5 = (xH – xL)(yH – yL) = p4 + p1 – p3 – p2 p3 + p2 = p4 + p1 – p5 Complexity analysis: A(k) = 3A(k/2) + (k) = (k1.585) x y p1 p2 p3 p4 p 1.585 = log2 3 If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). The benefit is significant for extremely wide operands (4/3)5 = 4.2 (4/3)10 = 17.8 (4/3)20 = 315.3 (4/3)50 = 1,765,781 Aug. 2018 Parallelism in Computer Arithmetic

Improvements to Karatsuba-Ofman Algorithm Original / Naive (k2) Karatsuba-Ofman (k1.585) Toom / Cook (k1.465), (k1.404) (k1+e) Schonhage-Strassen (k log k log log k) Furer Still faster Is (k log k) feasible? If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Aug. 2018 Parallelism in Computer Arithmetic

Similar Trick Used for Matrix Multiplication Strassen’s trick: If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Eight matrix multiplications reduced to 7 Original / Naive (n3) Strassen (k2.807) 2.807 = log2 7 Aug. 2018 Parallelism in Computer Arithmetic

Strassen Matrix Multiplication in Practice Time (s) Practical implementations in C++ (your results may vary) Advantages of Strassen’s algorithm show up for n ~ 3000 Naive O(n3) Strassen’s method does not show as much improvement as Karatsuba-Ofman’s because: - Its branching reduction factor is 7/8 instead of 3/4 - Matrix addition is relatively more complex than integer add If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Matrix size (n) https://mikecvet.wordpress.com/2010/04/17/strassens-algorithm-theory-vs-application-part-2/ Aug. 2018 Parallelism in Computer Arithmetic

IV. System-Level Parallelism Multiple independent or interacting arithmetic streams: - Early examples included using one or more co-processors - Modern embodiments entail the use of GPUs and the like Streamlined arithmetic blocks: No extra features Discrete Fourier Transform (DFT / FFT) Inputs x0, x1, . . . , xn–1 Outputs y0, y1, . . . , yn–1 yi = j=0:n–1 nijxj n is a primitive nth root of unity Naive method (n2) FFT (n log n) DFT x0 x1 x2 . xn–1 y0 y1 y2 yn–1 Inv. Aug. 2018 Parallelism in Computer Arithmetic

FFT Can Be Performed in Many Different Ways Quote from The Principles of Computer Hardware: “At least one good reason for studying multiplication and division is that there is an infinite number of ways of performing these operations and hence there is an infinite number of PhDs (or expenses-paid visits to conferences in the USA) to be won from inventing new forms of multiplier.” ~ Alan Clemens, 1985 The statement above is even more true for DFT / FFT! Google search for FFT yields 28M+ hits The 1965 paper by Cooley and Tukey has 14K+ citations Many books on FFT have 100s to 1000s of citations New ways of performing FFT are still being discovered Aug. 2018 Parallelism in Computer Arithmetic

Computation Scheme for 16-Point FFT n log n butterfly processors, each performing one operation Pipelining improves hardware utilization Aug. 2018 Parallelism in Computer Arithmetic

Butterfly and Shuffle-Exchange Networks Rearrangement of nodes makes inter-column connections identical Shuffle and shuffle-exchange link pairs replaced by separate shuffle and exchange links Aug. 2018 Parallelism in Computer Arithmetic

Projections to Reduce Hardware Complexity n log n cost log n time n cost log n time Horizontal projection: Reduces hardware complexity by a factor log n, without increasing the asymptotic time complexity log n cost n time Vertical projection: Reduces hardware complexity by a factor n, while increasing the asymptotic time complexity by n / log n Aug. 2018 Parallelism in Computer Arithmetic

Timing of Computations in Low-Cost FFT Circuit Butterfly processor The feedback connections Scheduling of computations to perform an n-point FFT in O(n) time with O(log n) processors Aug. 2018 Parallelism in Computer Arithmetic

Parallelism in Computer Arithmetic V. Conclusion: Where Are We, Where Next? I reviewed only 3 examples, but there are more - Parallel-prefix adders - Recursive/divide-and-conquer multipliers - Discrete Fourier Transform, DFT / FFT - Key role of GPUs in building exascale computers - High-precision, error-free, and wide-range arithmetic Path to further connections / interactions - Study cross-citation patterns between the two fields - Redundancy for data preservation and fault tolerance - New/emerging technologies: QCA, SET, Nanomagnets, … - Program portability via standardization - Speculative execution of multiple program paths Aug. 2018 Parallelism in Computer Arithmetic

? Questions or Comments? parhami@ece.ucsb.edu http://www.ece.ucsb.edu/~parhami/ ?