1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors
2 Problem Statement There’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) Combine several primitive operations We propose an automated method for CFU generation
3 System Overview
4 Example Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8
5 Example Potential CFUs 1,3 2,4 2,6 … 1,3,4 2,4,5 2,6,7 …
6 Example Potential CFUs 1,3 2,4 2,6 … 1,3,4,5 2,4,5,8 2,6,7,8 … 1,3,4,5,8
7 Characterization Use the macro library to get information on each potential CFU Latency is the sum of each primitive’s latency Area is the sum of each primitive’s macrocell
8 Issues we consider Performance On critical path Cycles saved Cost CFU area Control logic Difficult to measure Decode logic Difficult to measure Register file area Can be amortized LD ADD AND ASL XOR BR
9 More Issues to Consider IO number of input and output operands Usability How well can the compiler use the pattern OR LSL AND CMPP
10 Selection Currently use a Greedy Algorithm Pick the best performance gain / area first Can yield bad selections OR LSL AND CMPP
11 Case study 1: Blowfish Speedup: cycles can be compressed down to 2! Cost: ~6 adders 6 inputs, 2 outputs C code this DFG came from: r ^=(((s[(t>>24)] + s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff; ADD XOR ADD AND XOR LSR AND ADD LSL ADD r65 r70 r76 r81 # -1 r891 #16 #255 #256 #2 r91
12 Case study 2: ADPCM Decode Speedup: cycles can be compressed down to 1 Cost: ~1.5 adders 2 inputs, 2 outputs C code this DFG came from: d = d & 7; if ( d & 4 ) { … } AND CMPP #7r16 #4 #0
13 Experimental Setup CFU recognition implemented in the Trimaran research infrastructure Speedup shown is with CFUs relative to a baseline machine Four wide VLIW with predication Can issue at most 1 Int, Flt, Mem, Brn inst./cyc. 300 MHz clock CFU Latency is estimated using standard cells from Synopsis’ design library
14 Varying the Number of CFUs More CFUs yields more performance Weakness in our selection algorithm causes plateaus
15 Varying the Number of Ops Bigger CFUs yield better performance If they’re too big, they can’t be used as often and they expose alternate critical paths
16 Related Work Many people have done this for code size Bose et al., Liao et al. Typically done with traces Arnold, et al. Previous paper used more enumerative discovery algorithm We are unique because: Compiler based approach Novel analyzation of CFUs
17 Conclusion and Future Work CFUs have the potential to offer big performance gain for small cost Recognize more complex subgraphs Generalized acyclic/cyclic subgraphs Develop our system to automatically synthesize application tailored coprocessors