A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Slides:



Advertisements
Similar presentations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Advertisements

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Tile Size Selection for Low-Power Tile-based Architectures Michael Brown.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Efficient FPGA Implementation of QR
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
WoPANets: Decision-support Tool for real-time Networks Design
TI Information – Selective Disclosure
Backprojection Project Update January 2002
Lynn Choi School of Electrical Engineering
COMPUTATIONAL MODELS.
Evaluating Register File Size
ISPASS th April Santa Rosa, California
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
The University of Texas at Austin
Array Processor.
Centar ( Global Signal Processing Expo
Memory Hierarchies.
Shadow: Scalable and Deterministic Network Experimentation
High Throughput LDPC Decoders Using a Multiple Split-Row Method
The University of Adelaide, School of Computer Science
Final Project presentation
TI C6701 VLIW MIMD.
RAW Scott J Weber Diagrams from and summary of:
Memory System Performance Chapter 3
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
The University of Adelaide, School of Computer Science
DSP Architectures for Future Wireless Base-Stations
Utsunomiya University
Performing Security Auditing In Hardware
Presentation transcript:

A Quantitative Analysis of Stream Algorithms on Raw Fabrics Henry Hoffmann Anant Agarwal MIT CSAIL Boston Area Architecture Conference 21 January 2005

This talk explores practical applications of the theoretical framework Introduction Raw is a tiled microarchitecture characterized by: Low latency, high bandwidth networks Relatively small local memories, far from large backing memories Scalable hardware design allowing large raw fabrics to be built Raw is one of many single-chip, tiled microarchitectures Address growing concerns of wire delay and power consumption The Decoupled Systolic Architecture captures key features Provides a theoretical tool to explore performance on tiled archs. Allows performance characterization of algorithms This talk explores practical applications of the theoretical framework

Outline Decoupled Systolic Architecture and Stream Algorithms Stream Algorithms on Raw Experimental Methodology Results Conclusion

Stream Algorithms Decoupled Systolic Architecture Decoupled Systolic Algorithms Efficiency: E(N,R) = where N = problem size, R = length of array side, C = total number of operations, T = total number of time steps, P(R) + M(R) = total number of tiles C(N) R T(N, R) * (P(R) + M(R)) M(R) memory tiles – memory management units, only tiles that can access memory other than registers P(R) compute tiles – perform systolic computations, accessing only registers and networks Stream Algorithms – The class of decoupled systolic algorithms whose efficiency approaches 1 for large N and R

Methodology We use the cycle accurate Raw simulator Assume a 425 MHz clock – maximum Raw clock speed Raw emulates the decoupled systolic architecture Raw tiles act as compute tiles – don’t use local D$ Augment Raw simulator with memory tiles on periphery These memory tiles access all data Implement stream algorithms for Matrix multiplication Triangular solver LU factorization QR factorization Measure performance as a function of N: problem size (N x N matrices) R: array dimensions (R x R array of compute tiles + 4R memory tiles)

Results on Raw Prototype Fix R = 4 and measure computation rate for kernels Peak flop rate: 6.8 GFLOPS Computation Rate (GFLOPS) N

Results for Large Raw Fabrics Scale Matrix Multiplication and QR Factorization, N = 1024 Examine computation rate and speedup vs. R = 4 Speedup vs. R = 4 Computation Rate (GFLOPS) R R

Conclusions Raw provides scalable hardware Stream algorithms provide scalable software Together yield high-performance implementations Matrix multiply Close to ideal speedup, rapidly approaches peak performance On 1024 Raw tiles, sustained throughput of 414 GFLOPS QR Factorization Parallel efficiency of 75% on 1024 Raw Tiles Sustained throughput of 294 GFLOPS Future Work Automatic generation of stream algorithms Experimenting with template based approach Implementation of an entire application Candidate apps: MPEG encode/decode, DSP, scientific simulation Extend stream algorithm framework Develop a robust, formal notion of stream algorithms