Parallelism in Computer Arithmetic:

Parallelism in Computer Arithmetic:
A Historical Perspective 2010s 2000s 1990s 1980s 1970s 1960s 1950s Behrooz Parhami University of California, Santa Barbara Aug. 2018 Parallelism in Computer Arithmetic

About This Presentation
This slide show was first developed for an invited talk at a special session on computer arithmetic in honor of Drs. Graham Jullien and William Miller, held on Monday 8/06 at the 61st Midwest Symposium on Circuits and Systems, Windsor, Ontario, Canada, August 5-8, 2018. All rights reserved for the author. ©2018 Behrooz Parhami Edition Released Revised First Aug. 2018 File: Aug. 2018 Parallelism in Computer Arithmetic

Parallelism in Computer Arithmetic: A Historical Perspective
Many early parallel processing breakthroughs emerged from the quest for faster and higher-throughput arithmetic operations. Additionally, the influence of arithmetic techniques on parallel computer performance can be seen in diverse areas such the bit-serial arithmetic units of early massively parallel SIMD computers, pipelining and pipeline chaining in vector machines, design of floating-point standards to ensure the accuracy and portability of numerically-intensive programs, and prominence of GPUs in today’s top-of-the-line supercomputers. This paper contains a few representative samples of the many interactions and cross-fertilizations between computer-arithmetic and parallel-computation communities by presenting historical perspectives, case studies of state of art and practice, and directions for further collaboration. Abstract of talk included on this slide for completeness. Aug. 2018 Parallelism in Computer Arithmetic

My Personal Journey and Career
50 years since graduation We are here 1988 1986 1969 Grad 1968 1970 1974 30 years at UCSB My children, 23-33 Aug. 2018 Parallelism in Computer Arithmetic

I. Introduction: What Is Parallelism?
The two extreme views: - Any circuit that manipulates multiple bits at once is parallel - Must have concurrency at the level of large functional blocks My view: Parallel processing is possible at the three levels of circuits, function units, and compute nodes I will provide an example at each of the three levels: - Circuit level: Parallel-prefix adders - Function level: Recursive/divide-and-conquer multiplication - System level: Discrete Fourier transform, DFT/FFT The three levels of parallelism are not mutually exclusive and can be readily combined Aug. 2018 Parallelism in Computer Arithmetic

II. Circuit-Level Parallelism
Adders and multipliers are our two main workhorses - In this section, I cover parallel-prefix adders - Recursive multiplication is covered in Section III … although it has circuit-level embodiments as well Parallel-prefix computation - Given the inputs x0, x1, x2, x3, … , xk–1 - And an associative binary operator  - Compute all the prefixes of the expression x0  x1  x2  x3  …  xk–1 Example: Indexing via prefix sums … Aug. 2018 Parallelism in Computer Arithmetic

Share-Nothing vs. Share-Everything Carry Networks
x0 x1 x2 x3 cin . . . A Challenge: Find circuit sharing schemes that come close to A in speed and to B in cost B A: Full lookahead. Each carry, and thus sum bit is computed independently and in parallel B: Ripple-carry. Each carry circuit shares the entire circuit of the previous carry Aug. 2018 Parallelism in Computer Arithmetic

The Carry Operator and Block-Propagate/Generate
Parallel-prefix carries Denote (gi, pi) by xi x0  x1  x2  x3  …  xk–1 (g[0,2], p[0,2]) … c3 Aug. 2018 Parallelism in Computer Arithmetic

The Brent-Kung Carry Network
Aug. 2018 Parallelism in Computer Arithmetic

Design Alternatives and Tradeoffs
Brent-Kung: 6 levels 26 cells Kogge-Stone: 4 levels 49 cells Nearly an infinite number of hybrid designs are possible Hybrid: 5 levels 32 cells Aug. 2018 Parallelism in Computer Arithmetic

A Taxonomy of Parallel-Prefix Adders
Logic levels = log2k + l Fanout = 2f + 1 From: Harris, David, 2003 Wire tracks = 2t Aug. 2018 Parallelism in Computer Arithmetic

III. Function-Level Parallelism
Multiplication is now just as essential as addition - In this section, I cover divide-and-conquer multiplication - Many other multiplication schemes exist … several of them have parallel-processing connections Recursive multiplication xy = (2k/2xH + xL)(2k/2yH + yL) = 2kxHyH + 2k/2(xHyL + xLyH) + xLyL = 2kp4 + 2k/2(p3 + p2) + p1 Complexity analysis: T(k) = 4T(k/2) + (log k) = (k2) A(k) = A(k/2) + (k) = (k) x y p1 If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). p2 p3 p4 p Aug. 2018 Parallelism in Computer Arithmetic

Analysis of Recursive Multiplication
xy = (2k/2xH + xL)(2k/2yH + yL) = 2kxHyH + 2k/2(xHyL + xLyH) + xLyL = 2kp4 + 2k/2(p3 + p2) + p1 Complexity analysis (serial): T(k) = 4T(k/2) + (log k) = (k2) A(k) = A(k/2) + (k) = (k) Complexity analysis (parallel): T(k) = T(k/2) + (log k) = (log k) A(k) = 4A(k/2) + (k) = (k2) Theoretical lower bounds: AT = W(k3/2) AT2 = W(k2) p1 p2 p3 p4 p x y If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Aug. 2018 Parallelism in Computer Arithmetic

The Trick Proposed by Karatsuba and Ofman
Recursive multiplication xy = 2kp4 + 2k/2(p3 + p2) + p1 Compute the auxiliary term p5 = (xH – xL)(yH – yL) = p4 + p1 – p3 – p2 p3 + p2 = p4 + p1 – p5 Complexity analysis: A(k) = 3A(k/2) + (k) = (k1.585) x y p1 p2 p3 p4 p 1.585 = log2 3 If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). The benefit is significant for extremely wide operands (4/3)5 = 4.2 (4/3)10 = 17.8 (4/3)20 = (4/3)50 = 1,765,781 Aug. 2018 Parallelism in Computer Arithmetic

Improvements to Karatsuba-Ofman Algorithm
Original / Naive (k2) Karatsuba-Ofman (k1.585) Toom / Cook (k1.465), (k1.404) (k1+e) Schonhage-Strassen (k log k log log k) Furer Still faster Is (k log k) feasible? If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Aug. 2018 Parallelism in Computer Arithmetic

Similar Trick Used for Matrix Multiplication
Strassen’s trick: If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Eight matrix multiplications reduced to 7 Original / Naive (n3) Strassen (k2.807) 2.807 = log2 7 Aug. 2018 Parallelism in Computer Arithmetic

Strassen Matrix Multiplication in Practice
Time (s) Practical implementations in C++ (your results may vary) Advantages of Strassen’s algorithm show up for n ~ 3000 Naive O(n3) Strassen’s method does not show as much improvement as Karatsuba-Ofman’s because: - Its branching reduction factor is 7/8 instead of 3/4 - Matrix addition is relatively more complex than integer add If the four partial products are produced concurrently, then the coefficient 4 moves from the equation for T(k) to that for A(K), leading to T(k) = O(log k) and A(k) = O(k^2). Both schemes are better than the brute-force or paper-and-pencil algorithm and both are suboptimal with respect to the theoretical lower bounds of AT = Omega(k^1.5) and AT^2 = Omega(k^2). Matrix size (n) Aug. 2018 Parallelism in Computer Arithmetic

IV. System-Level Parallelism
Multiple independent or interacting arithmetic streams: - Early examples included using one or more co-processors - Modern embodiments entail the use of GPUs and the like Streamlined arithmetic blocks: No extra features Discrete Fourier Transform (DFT / FFT) Inputs x0, x1, , xn–1 Outputs y0, y1, , yn–1 yi = j=0:n–1 nijxj n is a primitive nth root of unity Naive method (n2) FFT (n log n) DFT x0 x1 x2 . xn–1 y0 y1 y2 yn–1 Inv. Aug. 2018 Parallelism in Computer Arithmetic

FFT Can Be Performed in Many Different Ways
Quote from The Principles of Computer Hardware: “At least one good reason for studying multiplication and division is that there is an infinite number of ways of performing these operations and hence there is an infinite number of PhDs (or expenses-paid visits to conferences in the USA) to be won from inventing new forms of multiplier.” ~ Alan Clemens, 1985 The statement above is even more true for DFT / FFT! Google search for FFT yields 28M+ hits The 1965 paper by Cooley and Tukey has 14K+ citations Many books on FFT have 100s to 1000s of citations New ways of performing FFT are still being discovered Aug. 2018 Parallelism in Computer Arithmetic

Computation Scheme for 16-Point FFT
n log n butterfly processors, each performing one operation Pipelining improves hardware utilization Aug. 2018 Parallelism in Computer Arithmetic

Butterfly and Shuffle-Exchange Networks
Rearrangement of nodes makes inter-column connections identical Shuffle and shuffle-exchange link pairs replaced by separate shuffle and exchange links Aug. 2018 Parallelism in Computer Arithmetic

Projections to Reduce Hardware Complexity
n log n cost log n time n cost log n time Horizontal projection: Reduces hardware complexity by a factor log n, without increasing the asymptotic time complexity log n cost n time Vertical projection: Reduces hardware complexity by a factor n, while increasing the asymptotic time complexity by n / log n Aug. 2018 Parallelism in Computer Arithmetic

Timing of Computations in Low-Cost FFT Circuit
Butterfly processor The feedback connections Scheduling of computations to perform an n-point FFT in O(n) time with O(log n) processors Aug. 2018 Parallelism in Computer Arithmetic

Parallelism in Computer Arithmetic
V. Conclusion: Where Are We, Where Next? I reviewed only 3 examples, but there are more - Parallel-prefix adders - Recursive/divide-and-conquer multipliers - Discrete Fourier Transform, DFT / FFT - Key role of GPUs in building exascale computers - High-precision, error-free, and wide-range arithmetic Path to further connections / interactions - Study cross-citation patterns between the two fields - Redundancy for data preservation and fault tolerance - New/emerging technologies: QCA, SET, Nanomagnets, … - Program portability via standardization - Speculative execution of multiple program paths Aug. 2018 Parallelism in Computer Arithmetic

? Questions or Comments? parhami@ece.ucsb.edu
?

Parallelism in Computer Arithmetic:

Similar presentations

Presentation on theme: "Parallelism in Computer Arithmetic:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelism in Computer Arithmetic:

Similar presentations

Presentation on theme: "Parallelism in Computer Arithmetic:"— Presentation transcript:

Similar presentations

About project

Feedback