Download presentation
Presentation is loading. Please wait.
Published byLee Allread Modified over 9 years ago
1
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd.
2
University of Michigan Electrical Engineering and Computer Science 2 A Case for Customization General purpose processors handles many applications fairly well, but… ► Each application has different requirements ► Need for efficient execution Impressive design wins through customization ► Performance, power, area ► Up to 3.5x speedup [Hot Chips 16]
3
University of Michigan Electrical Engineering and Computer Science 3 Computationally demanding parts of applications run on special hardware New instructions use the special hardware Instruction Set Customization CUSTOM XOR MPY LD XOR SHR XOR MOV MPY LDSHR AND
4
University of Michigan Electrical Engineering and Computer Science 4 Traditional vs. Transparent Customization High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change CPU Compute Accelerator (CCA) CPU Traditional Transparent
5
University of Michigan Electrical Engineering and Computer Science 5 Design of a Compute Accelerator Goal: support important computation subgraphs Array of function units ► Exploits subgraph parallelism ► Allows natural data propagation FU … … IN 1 … IN 2 … FetchFetch IssueIssue … ALU CCA … WBWB
6
University of Michigan Electrical Engineering and Computer Science 6 Or AndMov Or And Or AndMov Or And Mov Or AndMov Or And Mov 1 11 1 1 1 1 1 CCA Shape 164.gzip
7
University of Michigan Electrical Engineering and Computer Science 7 AndXor Add Mov 1 22 2 2 2 2 2 CCA Shape Blowfish
8
University of Michigan Electrical Engineering and Computer Science 8 Dynamic % of subgraphs using FU CCA Utilization 1234567 110059.022.913.16.54.20.3 291.150.69.94.10.60.20.0 357.417.86.32.90.10.0 418.58.31.60.10.0 58.72.10.10.0 62.11.20.10.0 71.20.1 0.0 80.1 0.0
9
University of Michigan Electrical Engineering and Computer Science 9 CCA Operations Dynamic opcodes in important subgraphs Excluded mpy/div, load/store, branch Two main categories – logicals, adds Subgraphs rarely have more than 3 dependent adds Opcode% Add28.7 And12.5 Move11.7 Sext10.4 Lshift9.8 Or8.7 Xor5.1 Sub4.8 Rshift2.4 Compare0.4
10
University of Michigan Electrical Engineering and Computer Science 10 Proposed CCA Design 4 inputs/2 outputs Two FU types ► Arith/logic ► Logic Crossbar between rows Captures > 99% of important subgraphs I1I2I1I3I4 O1O2
11
University of Michigan Electrical Engineering and Computer Science 11 Synthesis of CCA Synopsys design tools, 130nm library DepthConfigurationControl (bits)Delay (ns)Cell area (mm 2 ) Subgraphs Supported 7 6A-4L-4A- 3L-2A-2L-1L 2455.620.4899.3% 6 6A-4L-4A- 3L-2A-1L 2294.560.4595.1% 5 6A-4L-4A- 2L-1L 1973.500.4087.6% 4 6A-4L-3A-2L 1723.190.3881.8%
12
University of Michigan Electrical Engineering and Computer Science 12 + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE Static Dynamic CCA Utilization Realization Selection Static Dynamic
13
University of Michigan Electrical Engineering and Computer Science 13 … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … Dynamic Selection – Dynamic Realization Detect and replace subgraphs in fill unit of trace cache I-Cache Trace Cache RetireRetire...... ExecuteExecute...... DecodeDecode Trace Construction Subgraph Selection and Insertion … LSR r2, r2, #4 LD r3 CUSTOM SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR …
14
University of Michigan Electrical Engineering and Computer Science 14 Simulation SimpleScalar – ARM instruction set ► 4-wide Execution, 1 compute accelerator ► 128 RUU entries ► 32k inst. trace cache, 256 inst. Traces ► 5000 cycle selection/insert latency ► L1 I-cache : 32k, 2 way, 2 cycle hit ► L1 D-cache : 32k, 4 way, 2 cycle hit
15
University of Michigan Electrical Engineering and Computer Science 15 Varying CCA Latency 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 164.gzip 181.mcf 186.crafty 197.parser 300.twolf cjpeg djpeg epic g721encode gsmdecode mesamipmap mpeg2dec mpeg2enc pegwitdecpegwitenc rawdaudio unepic 3des blowfish rc4 sha Average Speedup 6 4 2 1 SPECint MediaBench Encryption Lat
16
University of Michigan Electrical Engineering and Computer Science 16 Static Selection – Dynamic Realization Compiler selects subgraphs offline Communicated to the hardware at load time ► Control bits stored in a table and inserted at decode … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … LSR r2, r2, #4 LD r3 CCA_Start #2 ADD r4, r1, #1 XOR r5, r4, r2 ADD r6, r5, r3 XOR r7, r6, r8 CCA_End SHR … I-Cache Control Table RetireRetire...... ExecuteExecute...... DecodeDecode
17
University of Michigan Electrical Engineering and Computer Science 17 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 164.gzip 181.mcf 186.crafty 197.parser 300.twolf cjpeg djpeg epic g721encode gsmdecode mesamipmap mpeg2decmpeg2enc pegwitdecpegwitenc rawdaudio unepic 3des blowfish rc4 sha Average Speedup Dynamic SelectionStatic Selection Dynamic vs. Static Selection SPECintMediaBenchEncryption
18
University of Michigan Electrical Engineering and Computer Science 18 Summary Transparent instruction set customization ► Benefits of customization without changing ISA Presented design of a compute accelerator ► Handle majority of important computation subgraphs in many benchmarks Developed ways to utilize the accelerator ► Table-based static selection – dynamic realization ► Trace cache based dynamic selection – dynamic realization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.