The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks
Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & SoCs Cool tool.. Actually work!! GYW
Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cool tool.. Actually work!! GYW
Who Cares about Accelerators Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD ASICs Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
Please do not distribute 4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition GYW
Please do not distribute 4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Well defined specs GYW
Please do not distribute 4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Workload definition, common baseline Well defined specs GYW
Please do not distribute 4/22/2017 Tower of Babel Effect Big Problem. Intro: Number of benchmarks that occur 25 recent Arch CAD papers ------------- FFT: Of 25 papers only 1 used across all 8 Come back later Problem: 64 used only ONCE Want general mechanisms/solutions need standards to measure contributions. GYW
MachSuite is/has 19 application specific accelerator workloads HLS and Aladdin compatible Workloads researchers are using today Diverse workloads for app space coverage Establishes standards without stifling creativity
Why MachSuite Existing Benchmarks are not applicable/sufficient Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details
Why machsuite Comparing benchmarks
Existing Benchmarks are Insufficient High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]
Existing Benchmarks are Insufficient High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { BackProp, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [MachSuite, IISWC/BARC]
Existing Benchmarks not Applicable Many Existing GPU Benchmarks Rodinia, Parboil, SHOC.. GPU and Accelerator design spaces differ Tuned for GPU architecture Implemented in CUDA/OpenCL GPU workloads subset of accelerators
Why machsuite simulator/hls friendly
Works with Accelerator CAD Tools Functions Units Resource Sharing Loop Pipelining Memory Bandwidth Vivado HLS Directives C Code RTL (Hardware Description Language) High-Level Synthesis
Works with Simulators MachSuite
Functions Unit Selection Works with Simulators MachSuite Functions Unit Selection Loop Pipelining Memory Bandwidth Directives Trade-off Power/Performance
Why machsuite workload diversity and coverage
Incorporates Applications of Interest
Covers Application Space FFT GEMM STENCIL 12 of 13 Dwarves
MachSuite Design Existing Benchmarks are not applicable/sufficient Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details
Machsuite design kernel selection
Kernel Selection Kernel = A specific problem E.g: SORT
Kernel Selection Kernel = A specific problem The Problem E.g: SORT Not all using the same kernels Comparing similar sounding kernels doesn’t work Let’s just pick one
Machsuite design algorithm choice
Algorithm Choice Algorithm = A specific solution A type of kernel E.g: Merge or Radix SORT
Algorithm Choice Algorithm = A specific solution The problem A type of kernel E.g: Merge or Radix SORT The problem Reporting kernel too high level Ideal algorithms different across SoCs Standardization without limitation
Machsuite design implementation details
Implementation Details Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil
Implementation Details Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil The problem Can cause misleading results Performance depends on tuning Separate signal from noise
Performance Variance due to Implementation Details Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation Shows Space of possiple hardware designs.. This is a subset, there are THOUSANDS. GYW
Performance Variance due to Implementation Details Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] ~ 10x Performance, same power GYW
Root Causing Inefficiency Please do not distribute 4/22/2017 Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] GYW
Please do not distribute 4/22/2017 What Happened “Unoptimized C Code” Pipelining result: Target II: 1, Final II: 30 “Optimized C Code” Target II: 1, Final II: 8 Pareto points from design space search Each dot has same directives Only difference is C code implementation 3.75x GYW
What Happened Unoptimized C Code Please do not distribute 4/22/2017 What Happened Unoptimized C Code for i = 1 : Block for radixID : Radix bucket[i*Block+radixID ] += bucket[i*Block+ radixID-1]; Cyclic partitioning Still performing local scans serially All targeting the same “bank” Inner loop unrolled!!! GYW
Please do not distribute 4/22/2017 What Happened Optimized C Code for radixID : Radix for i = 1 : Block bucket[i*Block +radixID ] += bucket[i*Block + radixID-1]; Cyclic partitioning Now, when you pipeline loop you utilize bandwidth. Each “mini scan” pipeline gets its own bank Inner loop unrolled GYW
Please do not distribute 4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
Please do not distribute 4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
Please do not distribute 4/22/2017 Solution MEMORY MEMORY ✔ SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
MachSuite 19 application specific accelerator workloads Benchmarks work with HLS and Aladdin Represents workloads researchers are using Diverse workloads, broad application space Standards with limited restrictions
MachSuite Available on GitHub http://breagen.github.io/MachSuite/ Publications Aladdin: [ ISCA’14 ] MachSuite: [ IISWC’14 ] Quantifying Acceleration: [ ISLPED’13 ]