Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
Introduction CS 524 – High-Performance Computing.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Please do not distribute
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Please do not distribute
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
The MachSuite Benchmark
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications Amol Bakshi, Jingzhao Ou, and Viktor K. Prasanna.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.
Caches for Accelerators
FPGA Hardware Synthesis Jessica Baxter. Reference M. Haldar, A. Nayak, N. Shenoy, A. Choudhary and P. Banerjee, “FPGA Hardware Synthesis from MATLAB”,
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Sunpyo Hong, Hyesoon Kim
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration Wei Zuo1, Warren Kemmerer1, Jong Bin Lim1,
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
Please do not distribute
Please do not distribute
Adaptive Median Filter
Please do not distribute
Evaluating Register File Size
Please do not distribute
Application-Specific Customization of Soft Processor Microarchitecture
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
STUDY AND IMPLEMENTATION
Circuit Design Techniques for Low Power DSPs
Final Project presentation
Measuring the Gap between FPGAs and ASICs
Application-Specific Customization of Soft Processor Microarchitecture
Introduction to Optimization
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015

Introduction Hardware accelerator A type of specializaion in the form of fixed, custom circuits designed for specific tasks. Types Tight coupled accelerator Loose coupled accelerator Processor core Tight coupled acc Processor core Loose coupled acc

Introduction Pareto optimized design Considering all the factors of an optimization, if we step further on one of the factors of the optimization, there will be negative effect on at least one of the other factors, then this design is called a Pareto optimized design This paper concentrates on the power/performance evaluation of different Pareto optimized loose coupled designs

Methodology - Workflow Use the same C code for both software and accelerator hardware

Methodology - Workflow C to RTL Synthesis – Vivado Different directives swept to get the optimaized design xilinx vivado user guide

Methodology - Workflow RTL Simulation – ModelSim Cycle accurate simulation provides accurate cycle counts and detaild activity factors for the following steps.

Methodology - Workflow RTL Synthesis – Design Compiler 40nm standard cell library Swept 2ns, 5ns, 8ns for clock period Activity factors getnerated from simulation used to estimate power Different types of DesignWare multiplier used

Methodology - Accelerator Workload SHOC Benchmark suite Tride C[i] = A[i] + s*B[i] Reduction b = ∑A[i] Scan B[i] = B[i-1] + A[i] Stencil Dot product of a fixed 3x3 filter and each 3x3 pixel sub block of an image GEMM Integer matrix multiply

Result Recall design spaces swept in the evaluation Directives of Vivado high level synthesis Loop unrolling0 : N Array partioning0 : N/2 Stage of pipeline0 : 7 Stage of multiplier0 : 6 Clock period in Design Compiler 2ns / 5ns / 8ns Results are normalized to that of the Cortex-M0

Result - Triad Completely parallel Simple computation more resources always result in greater performane at a linear power cost. C[i] = A[i] + s*B[i]

Result - Reduction Most efficient implementation is tree-adder Control overhead takes most of the part 2ns clock period results in the best performance Eleminating control overhead leads to better result Higher frequency does NOT always mean higher consumption b = ∑A[i]

Result - Scan Vivado cannot break loop because of the dependence 3 vertical lines conrespond to 3 different clock period reaching the performance limit Circuit or microarchitectural optimization is basically useless due to the data dependency B[i] = B[i-1] + A[i]

Result - Stencil Parallel in coarser granularity Reusing loaed data increases computational density Even modest unrolling factors allow heavy data reuse thus save considerable computational time Dot product of a fixed 3x3 filter and each 3x3 pixel sub block of an image

Result - GEMM Also parallel but at an even coarser granularity Consume more power to achieve the same performance as other algorithms. Integer matrix multiply

Understanding Acceleration Gains Metric to compare the overheads Instruction counts for Cortex-M0 Area for accelerators M0 is much more control intensive Accelerator is more compute intensive

Understanding Acceleration Gains Reduction Most effective due to the elimination of large control overhead Triad Parallel structure makes it performance gains unbounded Scan Low power but limited performace due to data dependency Stencil Parallel but needs multiplier, lee perfromance given a fixed power budget GEMM Substantial performance gains come at a larget power cost

Efficiency Differences in Pareto Optimal Design - Triad Nearly constant energy Linear decrease of overhead from 6 to 1 Intuitive conclusion: optimizing for performance eliminaties overhead

Efficiency Differences in Pareto Optimal Design - Stencil Energy cost range from 0.5x to 3.5x Data reuse and resource scheduling makes the control flow complicated leading to an unintuitive result.

Conclusion Non-intuitive combinations of architectural parameters can yield energy optimal designs. A quatitative formal process for finding optimal accelerator design is needed

Cited by Shao, Y.S.; Reagen, B.; Gu-Yeon Wei; Brooks, D. "Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures", Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, On page(s): 97 – 108Shao, Y.S.; Reagen, B.; Gu-Yeon Wei; Brooks, D. "Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures", Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, On page(s): 97 – 108 Shao, Y.S.; Reagen, B.; Gu-Yeon Wei; Brooks, D. "The Aladdin Approach to Accelerator Design and Modeling", Micro, IEEE, On page(s): Volume: 35, Issue: 3, May-June 2015Shao, Y.S.; Reagen, B.; Gu-Yeon Wei; Brooks, D. "The Aladdin Approach to Accelerator Design and Modeling", Micro, IEEE, On page(s): Volume: 35, Issue: 3, May-June 2015 Reagen, B.; Adolf, R.; Shao, Y.S.; Gu-Yeon Wei; Brooks, D. "MachSuite: Benchmarks for accelerator design and customized architectures", Workload Characterization (IISWC), 2014 IEEE International Symposium on, On page(s): Reagen, B.; Adolf, R.; Shao, Y.S.; Gu-Yeon Wei; Brooks, D. "MachSuite: Benchmarks for accelerator design and customized architectures", Workload Characterization (IISWC), 2014 IEEE International Symposium on, On page(s):

Questions