Download presentation
Presentation is loading. Please wait.
Published byBlaze Little Modified over 9 years ago
1
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan
2
STI Cell How to get Efficiency? Microarchitecture changes Multi- / many-core Heterogeneity 2 Core2 Duo
3
How is Heterogeneity Used? 3 Engineer/ Compiler GPP Hetero. Program Control Statically Placed in Binary
4
Problem With Static Control Not forward/backward compatible 4 CPU Hetero. CPU Hetero. Program Engineer/ Compiler
5
Solution: Virtualization Abstract accelerator features –Reexamine compiler algorithms Key: do the hard stuff offline 5 CPU Hetero. Program CPU Hetero. Dyn Comp. Dyn Comp. Dyn Comp. Engineer/ Compiler Offline Online
6
This Paper: Examines loops as heterogeneity target –ASICs often implement loops Design a generalized loop accelerator –Not covered in this talk Explore how to virtualize loop accelerators –I.e. abstract the accelerator interface 6
7
Loop Accelerator Template 7
8
Why More Efficient Than GPP? Simple control flow Decoupled memory accesses I-Cache unnecessary Customize execution resources for loops 8
9
Proposed Loop Accelerator 1 CCA 2 Int units 16 regs Memory (4x) –16 Input streams – 8 Output streams 0.8 mm 2, 90nm 9
10
Modulo Scheduling + High quality software pipelining technique + Simple control structure (low HW cost) - Can be slow, i.e., hard to do dynamically - Loops: no side exits, no while, if convertible 10
11
Benchmark Execution Time 11
12
Modulo Scheduling Basics 12 Kernel FU C
13
CCAInt CCAInt Modulo Scheduling Example 13 Priority: 2, 4, 6 3, 5 7 0 1 2 2 4 6 3 5 7 Time 1. CCA Mapping 2. II Calculation 3. Priority 4. Scheduling 5. Reg. assignment/ communication
14
Measured Scheduling Overhead 14 70% Priority, 19% CCA
15
Supporting Hybrid Compilation 15 Loop: 1 ld 2 add 3 sub and sub xor 5 or 6 or 7 add 8 str Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or 6 or 7 add 8 str CCA: and sub xor ret Data: 0 1 4 6 3 … Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or …
16
Speedups 16
17
Summary Virtualization key to heterogeneity VEAL speedup: 2.54 –2.63 w/o translation (i.e., not binary compatible) –2.17 fully dynamic CCA and priority: 89% overhead –mpeg2dec 2.1 vs. 1.15 17
18
Thank you! Questions? http://www.cc.gatech.edu/~ntclark http://cccp.eecs.umich.edu/ 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.