The MachSuite Benchmark

Slides:

Advertisements

Similar presentations

1 General-Purpose Languages, High-Level Synthesis John Sanguinetti High-Level Modeling.

Advertisements

Please do not distribute

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Introductory Comments Regarding Hardware Description Languages.

The Design Process Outline Goal Reading Design Domain Design Flow

© Dr. Alaaeldin Amin 1 Hardware Modeling & Synthesis Using VHDL Very High Speed Integrated Circuits Start Of VHDL Development First Publication.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

1 Chapter 13 Cores and Intellectual Property. 2 Overview FPGA intellectual property (IP) can be defined as a reusable design block (Hard, Firm or soft)

Team W1 Design Manager: Rebecca Miller 1. Bobby Colyer (W11) 2. Jeffrey Kuo (W12) 3. Myron Kwai (W13) 4. Shirlene Lim (W14) Stage II: 26 th January 2004.

Toward Cache-Friendly Hardware Accelerators

Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.

Please do not distribute

From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Please do not distribute

1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.

Chap. 1 Overview of Digital Design with Verilog. 2 Overview of Digital Design with Verilog HDL Evolution of computer aided digital circuit design Emergence.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

Architectural Optimizations David Ojika March 27, 2014.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

Automated Design of Custom Architecture Tulika Mitra

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

COE 405 Design and Modeling of Digital Systems

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.

An Overview of Hardware Design Methodology Ian Mitchelle De Vera.

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Modeling with hardware description languages (HDLs).

Modern VLSI Design 3e: Chapter 8 Copyright  1998, 2002 Prentice Hall PTR Topics n Modeling with hardware description languages (HDLs).

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.

Caches for Accelerators

FPGA Hardware Synthesis Jessica Baxter. Reference M. Haldar, A. Nayak, N. Shenoy, A. Choudhary and P. Banerjee, “FPGA Hardware Synthesis from MATLAB”,

FPGA-Based System Design Copyright  2004 Prentice Hall PTR Topics n Modeling with hardware description languages (HDLs).

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Please do not distribute

Please do not distribute

Parallel Patterns.

Please do not distribute

Please do not distribute

Please do not distribute

Ph.D. in Computer Science

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Topics Modeling with hardware description languages (HDLs).

Please do not distribute

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Topics Modeling with hardware description languages (HDLs).

Survey of Crypto CoProcessor Design

Hardware Description Languages

ESE532: System-on-a-Chip Architecture

Chapter 1 Introduction.

HIGH LEVEL SYNTHESIS.

Hardware Modeling & Synthesis Using VHDL

H a r d w a r e M o d e l i n g O v e r v i e w

Presentation transcript:

The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks

Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & SoCs Cool tool.. Actually work!! GYW

Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cool tool.. Actually work!! GYW

Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD ASICs Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition GYW

Please do not distribute 4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Well defined specs GYW

Please do not distribute 4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Workload definition, common baseline Well defined specs GYW

Please do not distribute 4/22/2017 Tower of Babel Effect Big Problem. Intro: Number of benchmarks that occur 25 recent Arch CAD papers ------------- FFT: Of 25 papers only 1 used across all 8 Come back later Problem: 64 used only ONCE Want general mechanisms/solutions need standards to measure contributions. GYW

MachSuite is/has 19 application specific accelerator workloads HLS and Aladdin compatible Workloads researchers are using today Diverse workloads for app space coverage Establishes standards without stifling creativity

Why MachSuite Existing Benchmarks are not applicable/sufficient Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

Why machsuite Comparing benchmarks

Existing Benchmarks are Insufficient High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]

Existing Benchmarks are Insufficient High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { BackProp, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [MachSuite, IISWC/BARC]

Existing Benchmarks not Applicable Many Existing GPU Benchmarks Rodinia, Parboil, SHOC.. GPU and Accelerator design spaces differ Tuned for GPU architecture Implemented in CUDA/OpenCL GPU workloads subset of accelerators

Why machsuite simulator/hls friendly

Works with Accelerator CAD Tools Functions Units Resource Sharing Loop Pipelining Memory Bandwidth Vivado HLS Directives C Code RTL (Hardware Description Language) High-Level Synthesis

Works with Simulators MachSuite

Functions Unit Selection Works with Simulators MachSuite Functions Unit Selection Loop Pipelining Memory Bandwidth Directives Trade-off Power/Performance

Why machsuite workload diversity and coverage

Incorporates Applications of Interest

Covers Application Space FFT GEMM STENCIL 12 of 13 Dwarves

MachSuite Design Existing Benchmarks are not applicable/sufficient Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

Machsuite design kernel selection

Kernel Selection Kernel = A specific problem E.g: SORT

Kernel Selection Kernel = A specific problem The Problem E.g: SORT Not all using the same kernels Comparing similar sounding kernels doesn’t work Let’s just pick one

Machsuite design algorithm choice

Algorithm Choice Algorithm = A specific solution A type of kernel E.g: Merge or Radix SORT

Algorithm Choice Algorithm = A specific solution The problem A type of kernel E.g: Merge or Radix SORT The problem Reporting kernel too high level Ideal algorithms different across SoCs Standardization without limitation

Machsuite design implementation details

Implementation Details Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil

Implementation Details Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil The problem Can cause misleading results Performance depends on tuning Separate signal from noise

Performance Variance due to Implementation Details Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation Shows Space of possiple hardware designs.. This is a subset, there are THOUSANDS. GYW

Performance Variance due to Implementation Details Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] ~ 10x Performance, same power GYW

Root Causing Inefficiency Please do not distribute 4/22/2017 Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] GYW

Please do not distribute 4/22/2017 What Happened “Unoptimized C Code” Pipelining result: Target II: 1, Final II: 30 “Optimized C Code” Target II: 1, Final II: 8 Pareto points from design space search Each dot has same directives Only difference is C code implementation 3.75x GYW

What Happened Unoptimized C Code Please do not distribute 4/22/2017 What Happened Unoptimized C Code for i = 1 : Block for radixID : Radix bucket[i*Block+radixID ] += bucket[i*Block+ radixID-1]; Cyclic partitioning Still performing local scans serially All targeting the same “bank” Inner loop unrolled!!! GYW

Please do not distribute 4/22/2017 What Happened Optimized C Code for radixID : Radix for i = 1 : Block bucket[i*Block +radixID ] += bucket[i*Block + radixID-1]; Cyclic partitioning Now, when you pipeline loop you utilize bandwidth. Each “mini scan” pipeline gets its own bank Inner loop unrolled GYW

Please do not distribute 4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

Please do not distribute 4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

Please do not distribute 4/22/2017 Solution MEMORY MEMORY ✔ SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

MachSuite 19 application specific accelerator workloads Benchmarks work with HLS and Aladdin Represents workloads researchers are using Diverse workloads, broad application space Standards with limited restrictions

MachSuite Available on GitHub http://breagen.github.io/MachSuite/ Publications Aladdin: [ ISCA’14 ] MachSuite: [ IISWC’14 ] Quantifying Acceleration: [ ISLPED’13 ]