A Quantitative Analysis of Stream Algorithms on Raw Fabrics Henry Hoffmann Anant Agarwal MIT CSAIL Boston Area Architecture Conference 21 January 2005
This talk explores practical applications of the theoretical framework Introduction Raw is a tiled microarchitecture characterized by: Low latency, high bandwidth networks Relatively small local memories, far from large backing memories Scalable hardware design allowing large raw fabrics to be built Raw is one of many single-chip, tiled microarchitectures Address growing concerns of wire delay and power consumption The Decoupled Systolic Architecture captures key features Provides a theoretical tool to explore performance on tiled archs. Allows performance characterization of algorithms This talk explores practical applications of the theoretical framework
Outline Decoupled Systolic Architecture and Stream Algorithms Stream Algorithms on Raw Experimental Methodology Results Conclusion
Stream Algorithms Decoupled Systolic Architecture Decoupled Systolic Algorithms Efficiency: E(N,R) = where N = problem size, R = length of array side, C = total number of operations, T = total number of time steps, P(R) + M(R) = total number of tiles C(N) R T(N, R) * (P(R) + M(R)) M(R) memory tiles – memory management units, only tiles that can access memory other than registers P(R) compute tiles – perform systolic computations, accessing only registers and networks Stream Algorithms – The class of decoupled systolic algorithms whose efficiency approaches 1 for large N and R
Methodology We use the cycle accurate Raw simulator Assume a 425 MHz clock – maximum Raw clock speed Raw emulates the decoupled systolic architecture Raw tiles act as compute tiles – don’t use local D$ Augment Raw simulator with memory tiles on periphery These memory tiles access all data Implement stream algorithms for Matrix multiplication Triangular solver LU factorization QR factorization Measure performance as a function of N: problem size (N x N matrices) R: array dimensions (R x R array of compute tiles + 4R memory tiles)
Results on Raw Prototype Fix R = 4 and measure computation rate for kernels Peak flop rate: 6.8 GFLOPS Computation Rate (GFLOPS) N
Results for Large Raw Fabrics Scale Matrix Multiplication and QR Factorization, N = 1024 Examine computation rate and speedup vs. R = 4 Speedup vs. R = 4 Computation Rate (GFLOPS) R R
Conclusions Raw provides scalable hardware Stream algorithms provide scalable software Together yield high-performance implementations Matrix multiply Close to ideal speedup, rapidly approaches peak performance On 1024 Raw tiles, sustained throughput of 414 GFLOPS QR Factorization Parallel efficiency of 75% on 1024 Raw Tiles Sustained throughput of 294 GFLOPS Future Work Automatic generation of stream algorithms Experimenting with template based approach Implementation of an entire application Candidate apps: MPEG encode/decode, DSP, scientific simulation Extend stream algorithm framework Develop a robust, formal notion of stream algorithms