Download presentation
Presentation is loading. Please wait.
1
The MachSuite Benchmark
Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks
2
Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & SoCs Cool tool.. Actually work!! GYW
3
Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cool tool.. Actually work!! GYW
4
Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD ASICs Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
5
Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
6
Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW
7
Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition GYW
8
Please do not distribute
4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Well defined specs GYW
9
Please do not distribute
4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Workload definition, common baseline Well defined specs GYW
10
Please do not distribute
4/22/2017 Tower of Babel Effect Big Problem. Intro: Number of benchmarks that occur 25 recent Arch CAD papers FFT: Of 25 papers only 1 used across all 8 Come back later Problem: 64 used only ONCE Want general mechanisms/solutions need standards to measure contributions. GYW
11
MachSuite is/has 19 application specific accelerator workloads
HLS and Aladdin compatible Workloads researchers are using today Diverse workloads for app space coverage Establishes standards without stifling creativity
12
Why MachSuite Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details
13
Why machsuite Comparing benchmarks
14
Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]
15
Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { BackProp, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [MachSuite, IISWC/BARC]
16
Existing Benchmarks not Applicable
Many Existing GPU Benchmarks Rodinia, Parboil, SHOC.. GPU and Accelerator design spaces differ Tuned for GPU architecture Implemented in CUDA/OpenCL GPU workloads subset of accelerators
17
Why machsuite simulator/hls friendly
18
Works with Accelerator CAD Tools
Functions Units Resource Sharing Loop Pipelining Memory Bandwidth Vivado HLS Directives C Code RTL (Hardware Description Language) High-Level Synthesis
19
Works with Simulators MachSuite
20
Functions Unit Selection
Works with Simulators MachSuite Functions Unit Selection Loop Pipelining Memory Bandwidth Directives Trade-off Power/Performance
21
Why machsuite workload diversity and coverage
22
Incorporates Applications of Interest
23
Covers Application Space
FFT GEMM STENCIL 12 of 13 Dwarves
24
MachSuite Design Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details
25
Machsuite design kernel selection
26
Kernel Selection Kernel = A specific problem E.g: SORT
27
Kernel Selection Kernel = A specific problem The Problem E.g: SORT
Not all using the same kernels Comparing similar sounding kernels doesn’t work Let’s just pick one
28
Machsuite design algorithm choice
29
Algorithm Choice Algorithm = A specific solution A type of kernel
E.g: Merge or Radix SORT
30
Algorithm Choice Algorithm = A specific solution The problem
A type of kernel E.g: Merge or Radix SORT The problem Reporting kernel too high level Ideal algorithms different across SoCs Standardization without limitation
31
Machsuite design implementation details
32
Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil
33
Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil The problem Can cause misleading results Performance depends on tuning Separate signal from noise
34
Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation Shows Space of possiple hardware designs.. This is a subset, there are THOUSANDS. GYW
35
Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] ~ 10x Performance, same power GYW
36
Root Causing Inefficiency
Please do not distribute 4/22/2017 Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] GYW
37
Please do not distribute
4/22/2017 What Happened “Unoptimized C Code” Pipelining result: Target II: 1, Final II: 30 “Optimized C Code” Target II: 1, Final II: 8 Pareto points from design space search Each dot has same directives Only difference is C code implementation 3.75x GYW
38
What Happened Unoptimized C Code
Please do not distribute 4/22/2017 What Happened Unoptimized C Code for i = 1 : Block for radixID : Radix bucket[i*Block+radixID ] += bucket[i*Block+ radixID-1]; Cyclic partitioning Still performing local scans serially All targeting the same “bank” Inner loop unrolled!!! GYW
39
Please do not distribute
4/22/2017 What Happened Optimized C Code for radixID : Radix for i = 1 : Block bucket[i*Block +radixID ] += bucket[i*Block + radixID-1]; Cyclic partitioning Now, when you pipeline loop you utilize bandwidth. Each “mini scan” pipeline gets its own bank Inner loop unrolled GYW
40
Please do not distribute
4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
41
Please do not distribute
4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
42
Please do not distribute
4/22/2017 Solution MEMORY MEMORY ✔ SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW
43
MachSuite 19 application specific accelerator workloads
Benchmarks work with HLS and Aladdin Represents workloads researchers are using Diverse workloads, broad application space Standards with limited restrictions
44
MachSuite Available on GitHub http://breagen.github.io/MachSuite/
Publications Aladdin: [ ISCA’14 ] MachSuite: [ IISWC’14 ] Quantifying Acceleration: [ ISLPED’13 ]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.