University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Slides:

Advertisements

Similar presentations

Programmable FIR Filter Design

Advertisements

Xianfeng Li Tulika Mitra Abhik Roychoudhury

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.

OPTIMAL FSMD PARTITIONING FOR LOW POWER Nainesh Agarwal and Nikitas Dimopoulos Electrical and Computer Engineering University of Victoria.

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

JMVA Comprehension and Analysis 475 Software Engineering for Industry - Coursework 1 Zhongxi Ren Tianyi Ma Qian Wang Zi Wang.

Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

The University of Adelaide, School of Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Implementation of a De-blocking Filter and Optimization in PLX

CSCI1600: Embedded and Real Time Software

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan

Electrical Engineering and Computer Science 2 Automated C to Gates Solution SoC design – Gops, 200 mW power budget –Low level tools ineffective Automated accelerator synthesis for whole application –Correct by construction –Increase designer productivity –Faster time to market app.c LA

University of Michigan Electrical Engineering and Computer Science 3 Streaming Applications Quantizer Motion Estimator TransformCoder Inverse Quantizer Inverse Transform Motion Predictor Image Coded Image H.264 Encoder Data “streaming” through kernels Kernels are tight loops –FIR, Viterbi, DCT Coarse grain dataflow between kernels –Sub-blocks of images, network packets Data in Data out CRC Conv./ Turbo Block Interleaver OVSF Generator Spreader/ Scrambler Baseband Trasmitter W-CDMA Transmitter RRC Filter

University of Michigan Electrical Engineering and Computer Science 4 Software Overview Whole Application System Level Synthesis Frontend Analyses Accelerator Pipeline Multifunction Accelerator SRAM Buffers Loop Graph

University of Michigan Electrical Engineering and Computer Science 5 Input Specification for(i=0; i<8; i++) { for(j=0; j<8; j++) {... = inp[i][j]; out[i][j] =... ; } row_trans(char inp[8][8], char out[8][8] ) { } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); dct (char inp[8][8], char out[8][8]) { row_trans col_trans zigzag_trans inp tmp1 tmp2 out Sequential C program Kernel specification –Perfectly nested FOR loop –Wrapped inside C function –All data access made explicit char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } System specification –Function with main input/output –Local arrays to pass data –Sequence of calls to kernels

University of Michigan Electrical Engineering and Computer Science 6 Performance Specification High performance DCT –Process one 1024x768 image every 2ms –Given 400 Mhz clock One image every cycles One block every 64 cycles Low Performance DCT –Process one 1024x768 image every 4ms –One block every 128 cycles 8 8 row_trans col_trans zigzag_trans inp tmp1 tmp2 out 8 8 Input image (1024 x 768) Output coeffs Task Performance goal : Task throughput in number of cycles between tasks

University of Michigan Electrical Engineering and Computer Science 7 Building Blocks Kernel 1 Kernel 2 Kernel 3 Kernel 4 Multifunction Loop Accelerator [CODES/ISSS ’06] tmp1 tmp2 tmp3 SRAM buffers

University of Michigan Electrical Engineering and Computer Science 8 System Schema Overview Kernel 1 Kernel 2 Kernel 4 LA 1 LA 2 LA 3 Kernel 3 Kernel 5 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 time Task throughput

University of Michigan Electrical Engineering and Computer Science 9 Cost Components Cost of loop accelerator data path –Cost of FUs, shift registers, muxes, interconnect Initiation interval (II) –Key parameter that decides LA cost Low II → high performance → high cost –Loop execution time ≈ (trip count) x II –Appropriate II chosen to satisfy task throughput II=1 K1 K2 K3 TC=100 II=2 Low performance K1 K2 K3 TC=100 K1 K2 K3 K1 K2 K3 Task 1 Task 2 K1 K2 K3 Task High performance Throughput = 1 task/100 cycles K1 K2 K3 K1 K2 K3 Task 1 Task Throughput = 1 task/200 cycles

University of Michigan Electrical Engineering and Computer Science 10 Cost Components (Contd..) Grouping of loops into a multifunction LA –More loops in a single LA → LA occupied for longer time in current task K1 K2 K3 TC=100 K3 TC=100 LA 2 LA 3 LA 1 K1 K2 K3 K4 LA 1 occupied for 200 cycles K1 K2 K K4 400 Throughput = 1 task / 200 cycles

University of Michigan Electrical Engineering and Computer Science 11 Cost Components (Contd..) Cost of SRAM buffers for intermediate arrays More buffers → more task overlap → high performance II=1 K1 K2 K3 TC=100 tmp1 tmp2 LA 1 LA 2 LA 3 K1 K2 K3 K1 K2 K LA 1 LA 2 LA 3 tmp1 buffer in use by LA2 K1 K2 K3 K1 K2 K Adjacent tasks use different buffers

University of Michigan Electrical Engineering and Computer Science 12 ILP Formulation Variables –II for each loop –Which loops are combined into single LA –Number of buffers for temp array Objective function –Cost of LAs + cost of buffers Constraints –Overall task throughput should be achieved

University of Michigan Electrical Engineering and Computer Science 13 Non-linear LA Cost II min II max II = 1*II 1 + 2*II 2 + 3*II *II 14 and 0 ≤ II i ≤ 1 Cost(II) = C 1 *II 1 + C 2 *II 2 + C 3 *II C 14 *II 14 II min ≤ II ≤ II max Relative Cost Initiation interval

University of Michigan Electrical Engineering and Computer Science 14 Multifunction Accelerator Cost LA 1 LA 2 LA 3 LA 4 LA 1 LA 2 LA 3 LA 4 LA 1 LA 2 LA 3 LA 4 Worst Case : No sharing Cost = Sum Realistic Case : Some sharing Cost = Between Sum and Max Best case : Full sharing Cost = Max Impractical to obtain accurate cost of all combinations C LA = 0.5 * (SUMC LA + MAXC LA )

University of Michigan Electrical Engineering and Computer Science 15 Case Study : “Simple” benchmark Loop graph TC= cycles LA 1 LA 2 LA 3 LA cycles 1536 cycles LA 1 LA 2 LA cycles

University of Michigan Electrical Engineering and Computer Science 16 Beamformer Beamformer 10 loops Memory Cost – 60% to 70% Up to 20% cost savings due to hardware sharing in multifunction accelerators Systems at lower throughput have over-designed LAs –Not profitable to pick a lower performance LA Memory buffer cost significant –High performance producer consumer better than more buffers

University of Michigan Electrical Engineering and Computer Science 17 Conclusions Automated design realistic for system of loops Designers can move up the abstraction hierarchy Observations –Macro level hardware sharing can achieve significant cost savings –Memory cost is significant – need to simultaneously optimize for datapath and memory cost ILP formulation tractable –Solver took less than 1 minute for systems with 30 loops

University of Michigan Electrical Engineering and Computer Science 18