Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine b Department of Computer Science and Engineering University of California, San Diego c Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx
David Sheldon, UC Riverside 2 of 22 FPGA Soft Core Processors Soft-core Processor HDL description Flexible implementation FPGA or ASIC Technology independent HDL Description FPGAASIC Spartan 3Virtex 2Virtex 4
David Sheldon, UC Riverside 3 of 22 FPGA Soft Core Processors Soft Core Processors can have configurable options Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPGA μPμP Cache FPU MAC
David Sheldon, UC Riverside 4 of 22 Conjoinment Overview Base micro- processor FPU Base micro- processor FPU Application 1 Application 2 “Conjoining” Add necessary units to both processors Conjoin the FPU Unit Conjoined FPU unit
David Sheldon, UC Riverside 5 of 22 Conjoinment Background Conjoinment proposed for multicore desktop processing (Kumar 2004) Reduces size with reasonable performance overhead e.g., cache conjoinment overhead: 1%-13% ICache SharingDCache Sharing
David Sheldon, UC Riverside 6 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?
David Sheldon, UC Riverside 7 of 22 Area Savings Significant potential area savings Limitations Does not consider multiplexing costs Due to absence of FPGA synthesis tools supporting conjoinment But good potential justifies further investigation Base MicroBlaze Multiplier Barrel Shifter Divider FPU Unit Size Multiplier Barrel Shifter Divider FPU % 4% 23% 32%
David Sheldon, UC Riverside 8 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?
David Sheldon, UC Riverside 9 of 22 Performance Overhead No simulator exists for conjoined processors We developed our own Trace-based conjoined processor simulator Conj. simulator Simulation uses pessimistic performance assumptions Kumar's techniques can improve Simulator outputs contention information Final cycles can be compared to unconjoined to determine performance overhead brev bitmnp Xilinx simulator app1 app2 trace1trace2 Access stall Contention stall
David Sheldon, UC Riverside 10 of 22 Performance Overhead brev bitmnp 17% 2.4% Speedup: Application time on optimally configured processor / avg. app. time on base processor Compared configuration with conjoinment versus without Performance overhead usually small, averaged just 4.2% Overhead caused by access delays and contention of the hardware units
David Sheldon, UC Riverside 11 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?
David Sheldon, UC Riverside 12 of 22 NO FPU Tuning Heuristic 5 choices per unit e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined 4 units 5 4 = 625 possible configurations Simulation: ~30 minutes per configuration Need search heuristic to tune Base MicroBlaze 1 Base MicroBlaze 2 FPU 2 FPU conjoined Multiplier Barrel Shifter Divider Multiplier FPU 1
David Sheldon, UC Riverside 13 of 22 Map to 0-1 Knapsack Problem MicroBlaze Multiplier size perf Divider size perf size perf Barrel Shifter perf size FPU BS Perf increment Size increment FPUMULDIV Perf/Size Creating the model Synthesis MicroBlaze FPU Synthesis App Base
David Sheldon, UC Riverside 14 of 22 Map to 0-1 Knapsack Problem First consider tuning without conjoinment Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem Add items, each with weight and benefit, to weight- constrained knapsack such that profit maximized MUL FPU 1 Base MicroBlaze MUL FPU 2 Available FPGA Base MicroBlaze Items: Weights: Benefits: Knapsack Note: Mapping inexact – weights/benefits not strictly additive MUL 1 FPU 1 MUL 2
David Sheldon, UC Riverside 15 of 22 Disjunctively Constrained Knapsack Problem: If conjoined unit included, can't also include standalone unit Solution: Map to disjunctively-constrained 0-1 knapsack Yanada T., “Heuristic and Exact Algorithms for the Disjunctively Constrained Knapsack Problem”, 2002 Prohibits specific item pairs from being in the knapsack ILP solution, running time is pseudo polynomial Base MicroBlaze Available FPGA Base MicroBlaze Knapsack MUL FPU 1 MUL FPU 2 Items: MUL C C C FPU C
David Sheldon, UC Riverside 16 of 22 Disjunctively Constrained Knapsack Base MicroBlaze Available FPGA Base MicroBlaze Knapsack MUL FPU 1 MUL FPU 2 Items: MUL C C C FPU C Weights: Benefits: Weights: Benefits 1: Benefits 2: MUL 1 MUL C Conjoined benefits shows a small decrease in benefit from the unconjoined unit Conjoined units provide benefits to both processors
David Sheldon, UC Riverside 17 of 22 Disjunctively Constrained Knapsack Running Time Modeling 5 Synthesis runs for each Processor At most 4 runs of the conjoined Simulator Disjunctively Constrained 0-1 Knapsack NP-complete problem Solved with a heuristic Heuristic takes < 1 min
David Sheldon, UC Riverside 18 of 22 Results Data gathered for the Xilinx Microblaze Soft- core Processor 10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk Obtained results for all possible pairwise conjoinment We only show conjoinment data when both applications use unit To avoid making conjoinment appear better than it is
David Sheldon, UC Riverside 19 of 22 Results Knapsack approach finds near-optimal in most cases
David Sheldon, UC Riverside 20 of 22 Results Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment) Runs in seconds One example had sub-optimal results (2.9 times slower) Performance overhead due to conjoinment just a few percent on average
David Sheldon, UC Riverside 21 of 22 Results On average the knapsack approach yields the same size as the exhaustive with conjoinment Average size savings of 16%
David Sheldon, UC Riverside 22 of 22 Conclusions Conjoining two soft-core FPGA processors reduces average size by 16% Performance overhead just a few percent in most cases Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases But could be improved for some examples Future Consider multiplexing size and delay overheads Apply Kumar's advanced conjoining techniques to reduce overheads