Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan
2
Electrical Engineering and Computer Science 2 Introduction Migration of applications Programmability and cost issues in ASIC More functionality in the embedded processor
3
University of Michigan Electrical Engineering and Computer Science 3 What Are the Challenges Accelerator Hardware: Compiler Algorithm:
4
University of Michigan Electrical Engineering and Computer Science 4 Configurable Compute Array (CCA) Array of FUs Arithmetic/logic 32-bit functional units Full interconnect between rows Supports 95 percent of all computation patterns (Nathan Clark, ISCA 2005) Input1 Input2Input3Input4 Output1 Output2
5
University of Michigan Electrical Engineering and Computer Science 5 Report Card on the Original CCA Easy to integrate to current embedded systems High performance gain however... 32-bit general purpose CCA: –130nm standard cell library –Area requirement: 0.3mm 2 –Latency: 3.3ns die photo of a processor with CCA
6
University of Michigan Electrical Engineering and Computer Science 6 Objectives of this Work Redesign of the CCA hardware –Area –Latency Compilation strategy –Code quality –Runtime
7
University of Michigan Electrical Engineering and Computer Science 7 Width Utilization Full width of the FUs is not always needed. Narrower FUs is not the solution. BenchmarkLess than 16-bit Less than 8-bit Rawcaudio94%52% Rawdaudio91%60% Epic80%45% Unepic74%40% Cjpeg76%49% Djpeg70%53% Larger than 16-bit Larger than 8-bit 3des86%90% bitcount80%85% rijndael50%64%
8
University of Michigan Electrical Engineering and Computer Science 8 Width-Aware Narrow CCA Width Checker Carry bits [8-31] Iterate Iteration Controller Input Registers Carry Bits Iterate [8-31] [0-7] Output 1 Output 2 - [0-7] [0-7] [0-7] Output Registers CCA [8-31]
9
University of Michigan Electrical Engineering and Computer Science 9 Sparse Interconnect Rank wires based on utilization. >50% wires removed. 91% of all patterns are supported. Input1 Input2Input3Input4 Output1 Output2 Input1 Input2Input3Input4 Output1 Output2
10
University of Michigan Electrical Engineering and Computer Science 10 Synthesis Results Accelerator ConfigurationLatency (ns)Area(mm 2 ) 32-bit with full interconnect3.300.301 32-bit with sparse interconnect2.950.270 16-bit with full interconnect2.880.168 16-bit with sparse interconnect2.550.140 8-bit with full interconnect2.560.080 8-bit with sparse interconnect2.000.070 Width Checker0.390.002 Synthesized using Synopsys and Encounter in 130nm library.
11
University of Michigan Electrical Engineering and Computer Science 11 Compilation Challenges Best portions of the code Non-uniform latency What are the current solutions: –Hand coding –Function intrinsics –Greedy solution
12
University of Michigan Electrical Engineering and Computer Science 12 Step 1: Enumeration Live Out Live In ADD AND ADD ORXORAND ADD CMP Live Out Live In 3 4 1 2 5 6 7 8 3 ADD 8 OR ADD XOR 6 7 AND ADD 3 4 6 AND ADD 3 5
13
University of Michigan Electrical Engineering and Computer Science 13 Step 2: Subgraph Isomorphism Pruning Ensure subgraphs can run on accelerator 6 SUB 11 ADD 10 SHRA 8 SHL 3 AND << * Logic >> +/- ABC DEF GH << * 3 >> +/- ABC DEF GH << * 3 >> 6 +/- ABC DEF GH << * 3 >> 6 11 +/- ABC DEF GH << * 3 >> 106 11 +/- ABC DEF GH << * 3 10 >> 6 11 +/- ABC DEF GH 8 * 3 10 >> 6 11 +/- ABC DEF GH
14
University of Michigan Electrical Engineering and Computer Science 14 Step 3: Grouping Live Out Live In ADDAND ADDOR XORANDADD CMP Live Out Live In 3 4 1 2 5 6 7 8 A B C D F E Live Out Live In ADDAND ADDOR XORANDADD CMP Live Out Live In 3 4 1 2 5 6 7 8 A B C D F E AC Assuming A and C are the only possibilities for grouping.
15
University of Michigan Electrical Engineering and Computer Science 15 Dealing with Non-uniform Latency OR ADD AND W [0,8] W [9,16] W [17,24] W [25,32] Average Latency ADD100%0% 1 OR0%50%0%50%3 AND0%50% 0%2.5 Subgraph Cost:3 Benefit: 0 8 bit 24 bit 8 bit 24 bit 8 bit 24 bit ABCABC Average Latency =2 Time >94% do not change width
16
University of Michigan Electrical Engineering and Computer Science 16 Step 4: Unate Covering WidthOp IDABCACDEFGH…N 241111… 82111… 31111… 84111… 32511… 611… 8711… 8811…1 Cost343314411…1 Benefit 11 00…0
17
University of Michigan Electrical Engineering and Computer Science 17 Experimental Evaluation ARM port of Trimaran compiler system Processor model –ARM-926EJS –Single issue, in-order execution, 5 stage pipeline –I/D caches : 16k, 64-way Hardware simulation: SimpleScalar 4.0
18
University of Michigan Electrical Engineering and Computer Science 18 Comparison of Different CCAs 16-bit and 8-bit CCAs are 7% and 9% better than 32-bit CCA. Assuming clock speed(1/(3.3ns) = 300 MHZ)
19
University of Michigan Electrical Engineering and Computer Science 19 Comparison of Different Algorithms Previous work: Greedy 10% worse than data-unaware
20
University of Michigan Electrical Engineering and Computer Science 20 Conclusion Programmable hardware accelerator Width-aware CCA: Optimizes for common case. 64% faster clock 4.2x smaller Data-centric compilation: Deals with non- uniform latency of CCA. Average 6.5%, Max 12% better than data-unaware algorithm.
21
University of Michigan Electrical Engineering and Computer Science 21 ? For more information: http://cccp.eecs.umich.edu/
22
University of Michigan Electrical Engineering and Computer Science 22 Data-Centric FEU
23
University of Michigan Electrical Engineering and Computer Science 23 FU ABCD 1 D0 C 2 00 8 ADD 1 OR 0 ADD 1 0 0 0 1 89 BCD 0 OR 0 ADD 0 A 1 D0 C 2 00 8 0 OR 0 ADD 0 1 0 1 5 1 22 Operation of Narrow CCA [(0x1D + 0x0C) + (0x20 OR 0x08)]
24
University of Michigan Electrical Engineering and Computer Science 24 Data-Centric Subgraph Mapping Enumerate –All subgraphs Pruning –Subgraph isomorphism Grouping –Iteratively group disconnected subgraphs Selection –Unate covering Shrink search space to control runtime Enumeration Pruning Grouping Selection
25
University of Michigan Electrical Engineering and Computer Science 25 How Good is the Cost Function Almost all of the operands have the same width range through out the execution.
26
University of Michigan Electrical Engineering and Computer Science 26
27
University of Michigan Electrical Engineering and Computer Science 27 Width Utilization Full width of the FUs is not always needed. Replacing FUs with narrower FUs is not a good idea by itself. BenchmarkLess than 16-bit Less than 8-bit Rawcaudio94%52% Rawdaudio91%60% Epic80%45% Unepic74%40% Cjpeg76%49% Djpeg70%53% Larger than 16-bit Larger than 8-bit 3des86%90% bitcount80%85% rijndael50%64%
28
University of Michigan Electrical Engineering and Computer Science 28 Introduction Migration of applications Programmability and cost issues in ASIC More functionality in the embedded processor
29
University of Michigan Electrical Engineering and Computer Science 29 What Are the Challenges Accelerator Hardware: Compiler Algorithm:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.