Download presentation
Presentation is loading. Please wait.
Published byΕλλεν Δαμασκηνός Modified over 6 years ago
1
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1
2
A Case for Customization
General purpose processors handles many applications fairly well, but… Each application has different requirements Need for efficient execution Impressive design wins through customization Performance, power, area Up to 3.5x speedup [Hot Chips 16] 2
3
Instruction Set Customization
Computationally demanding parts of applications run on special hardware New instructions use the special hardware MPY LD LD SHR MPY XOR SHR AND CUSTOM XOR MOV XOR 3
4
Traditional vs. Transparent Customization
High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change Traditional Transparent CPU CPU CPU Compute Accelerator (CCA) CPU Reverification of core Refabricate lithography masks Retarget software tool chain CPU CPU 4
5
Design of a Compute Accelerator
FU … IN 1 IN 2 Goal: support important computation subgraphs Array of function units Exploits subgraph parallelism Allows natural data propagation F e t c h I s u e CCA W B … … ALU ALU 5
6
CCA Shape 164.gzip 1 Or And Mov Or And Mov Or And Mov 6
An array of function units derived from the target ISA is the obvious structure for the computational accelerator, so what do we make it look like. 6
7
CCA Shape Blowfish 1 2 And Xor Add Mov 7
An array of function units derived from the target ISA is the obvious structure for the computational accelerator, so what do we make it look like. 7
8
CCA Utilization Dynamic % of subgraphs using FU 1 2 3 4 5 6 7 100 59.0
22.9 13.1 6.5 4.2 0.3 91.1 50.6 9.9 4.1 0.6 0.2 0.0 57.4 17.8 6.3 2.9 0.1 18.5 8.3 1.6 8.7 2.1 1.2 8 8
9
CCA Operations Dynamic opcodes in important subgraphs
Excluded mpy/div, load/store, branch Two main categories – logicals, adds Subgraphs rarely have more than 3 dependent adds Opcode % Add 28.7 And 12.5 Move 11.7 Sext 10.4 Lshift 9.8 Or 8.7 Xor 5.1 Sub 4.8 Rshift 2.4 Compare 0.4 9
10
Proposed CCA Design 4 inputs/2 outputs Two FU types
Arith/logic Logic Crossbar between rows Captures > 99% of important subgraphs I1 I1 I2 I3 I4 O1 O2 10
11
Synthesis of CCA Synopsys design tools, 130nm library 7 245 5.62 0.48
Depth Configuration Control (bits) Delay (ns) Cell area (mm2) Subgraphs Supported 7 6A-4L-4A-3L-2A-2L-1L 245 5.62 0.48 99.3% 6 6A-4L-4A-3L-2A-1L 229 4.56 0.45 95.1% 5 6A-4L-4A-2L-1L 197 3.50 0.40 87.6% 4 6A-4L-3A-2L 172 3.19 0.38 81.8% 11
12
CCA Utilization Realization Selection – Simple selection
Static Dynamic + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE Static Selection Dynamic 12
13
Dynamic Selection – Dynamic Realization
Detect and replace subgraphs in fill unit of trace cache I-Cache D e c o d . E x e c u t . R e t i r … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … LSR r2, r2, #4 LD r3 CUSTOM SHR Trace Cache Subgraph Selection and Insertion Trace Construction 13
14
Simulation SimpleScalar – ARM instruction set
4-wide Execution, 1 compute accelerator 128 RUU entries 32k inst. trace cache, 256 inst. Traces 5000 cycle selection/insert latency L1 I-cache : 32k, 2 way, 2 cycle hit L1 D-cache : 32k, 4 way, 2 cycle hit 14
15
Varying CCA Latency SPECint MediaBench Encryption Lat 15 1.45 1.40
1.35 6 1.30 4 2 1.25 Speedup 1 1.20 1.15 1.10 1.05 1.00 rc4 cjpeg djpeg epic sha unepic 3des 164.gzip 181.mcf 300.twolf blowfish Average 186.crafty 197.parser g721encode gsmdecode mpeg2dec mpeg2enc pegwitdec pegwitenc rawdaudio mesamipmap 15
16
Static Selection – Dynamic Realization
Compiler selects subgraphs offline Communicated to the hardware at load time Control bits stored in a table and inserted at decode … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR LD r3 CCA_Start #2 CCA_End I-Cache Control Table R e t i r . E x c u D o d 16
17
Dynamic vs. Static Selection
SPECint MediaBench Encryption 1.45 Dynamic Selection Static Selection 1.40 1.35 1.30 1.25 Speedup 1.20 1.15 1.10 1.05 1.00 rc4 cjpeg djpeg epic 3des sha 164.gzip 181.mcf unepic blowfish Average 186.crafty 197.parser 300.twolf mpeg2dec mpeg2enc g721encode gsmdecode mesamipmap pegwitdec pegwitenc rawdaudio 17
18
Summary Transparent instruction set customization
Benefits of customization without changing ISA Presented design of a compute accelerator Handle majority of important computation subgraphs in many benchmarks Developed ways to utilize the accelerator Table-based static selection – dynamic realization Trace cache based dynamic selection – dynamic realization 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.