Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx Collaborators: David Sheldon (4 th yr UCR PhD student), Roman Lysecky (PhD UCR 2005, now Asst. Prof. at U. Arizona), Rakesh Kumar (PhD UCSD 2006, now Asst. Prof. at UIUC), Dean Tullsen (Prof. at UCSD)
Frank Vahid, UC Riverside2/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work "Warp processing" Standard binaries for FPGAs
Frank Vahid, UC Riverside3/57 Microblaze Customization (ICCAD paper #1) FPGAs an increasingly popular software platform FPGA soft core processor Microprocessor synthesized onto FPGA fabric Soft core customization Cores come with configurable parameters Xilinx Microblaze comes with several instantiatable units: multiplier, barrel shifter, divider, FPU, or cache Customization: Tuning soft core parameters to a specific application Micro- processor Mul Micro- processor BS Div FPU I$ App1 Mul I$ Micro- processor App2 Mul FPU Div
Frank Vahid, UC Riverside4/57 Instantiable Unit Speedups Instantiating units can yield significant speedups “base” – Microblaze without any optional units instantiated
Frank Vahid, UC Riverside5/57 Customization Tradeoffs Data for aifir EEMBC benchmark on Xilinx Microblaze synthesized to Virtex device 2x performance tradeoff 4.5x size tradeoff
Frank Vahid, UC Riverside6/57 “Size” on an FPGA Defining a circuit’s “size” on an FPGA requires some work Different resources Lookup tables (LUTs) Embedded multipliers Embedded block RAM (BRAM) Our solution: Define “equivalent LUTs” for multipliers and BRAM Based on total LUTs, multipliers, and BRAMs in a “full” Microblaze Later found to closely match Xilinx’s “equivalent gates” concept Image courtesy of Xilinx Regular LUTs Mult 18x18 Equiv LUTs Barrel Shifter2280 Divider1220 Multiplier Floating Point Unit Base MB15700 Full MB
Frank Vahid, UC Riverside7/57 Goal: Customize Soft Core to Minimize Application Runtime With and without size constraint Even without size constraint, must take care because some units reduce clock frequency and thus may slow runtime Slower!
Frank Vahid, UC Riverside8/57 Goal : Customize Soft Core to Minimize Application Runtime With and without size constraint Even without size constraint, must take care because some units reduce clock frequency and thus may slow runtime "Full MB" – MB with all units instantiated
Frank Vahid, UC Riverside9/57 Key Problem Related to Core Customization Problem: Synthesis of one soft core configuration, and execution/simulation on that configuration, requires about 15 minutes Thus, for reasonable customization tool runtimes, can only synthesize 5-10 configurations in search of best one
Frank Vahid, UC Riverside10/57 Two Solution Approaches Traditional CAD approach Pre-characterize using synthesis and execution/simulation, create abstract problem model, solve using CAD exploration algorithms Used 0-1 knapsack formulation Synthesis-in-the-loop approach Run synthesis and execute/simulate application while exploring More accurate Pre-characterize Model Explore Synthesis and execution/simulation Explore Synthesis and execution/simulation start finish start finish Typically some form of graph 5-10 executions 5-10 interations
Frank Vahid, UC Riverside11/57 Traditional CAD Approach Map to 0-1 knapsack problem 0-1 knapsack problem Given set of items, each with value and weight Maximize value of items in a weight-constrained knapsack Mapping Item: Instantiatable unit Value of an item: Speedup increment when instantiating the unit, vs. base MB Weight of an item: Equivalent LUTs Knapsack weight constraint: Equivalent LUTs constraint Mul Micro- processor BS Div FPU I$ App1 Mul I$ “Items” “Knapsack” Speedup increment MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache
Frank Vahid, UC Riverside12/57 Traditional CAD Approach Speedup increment MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache Size (equiv LUTs) MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache Unit’s size determined by synthesizing MB with only that unit instantiated Requires 5 synthesis runs, +1 for base About 1 hour Unit’s speedup increment determined by comparing runtime of application with and without the unit Well-known knapsack heuristic uses value/weight ratio, applies dynamic programming. Complexity is O(n*W) n: number of items, W: knapsack constraint Runtime is negligible (seconds) Different for every application
Frank Vahid, UC Riverside13/57 Problem with Traditional CAD Approach Does not consider interactions among units Speedup increments may not be additive for given application e.g., Mul 0.4, BS 0.3, but Mul & BS 0.6, not 0.7 Thus, not a perfect mapping to the 0-1 knapsack problem Because item weights don’t add perfectly Component Cache Floating Point Divider Multiplier Barrel Shifter5.2 %1.0 %0.0 %10.4 % Multiplier6.7 %1.9 %26.0 % Divider2.9 %0.0 % Floating Point5.1 % Average pairwise speedup-increment additive inaccuracies for all pairs of benchmarks
Frank Vahid, UC Riverside14/57 Synthesis-in-the-Loop Approach View solution space as tree Level: Instantiate given unit? If gives speedup and fits, instantiate Use unit speedup/size to order tree “Impact-ordered tree” approach 5 synthesis runs, +1 for base, to determine units’ speedup/size Then requires 5 more synthesis runs to descend through tree 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache YesNo base+bs base Barrel shifter Multiplier Divider FPU Cache base+bs base+bs+mul base base+mul
Frank Vahid, UC Riverside15/57 Synthesis-in-the-Loop Approach View solution space as tree, each level a decision for a unit, order levels by unit speedup/size for application 11 synthesis runs make take a few hours To reduce, can consider using pre- determined order Determined by soft core vendor based on averages over many benchmarks 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache YesNo base+div base Barrel shifter Multiplier Divider FPU Cache Application- specific impact- ordered tree Divider Barrel Shifter Multiplier FPU Cache Fixed impact- ordered tree
Frank Vahid, UC Riverside16/57 Synthesis-in-the-Loop Approach Data for fixed impact-ordered tree for 11 EEMBC benchmarks Speedup MultiplierBarrel Shifter FPUDividerCache Size (Equiv LUTs) MultiplierBarrel Shifter FPUDividerCache Speedup/Size MultiplierBarrel Shifter FPUDividerCache
Frank Vahid, UC Riverside17/57 Customization Results Fixed tree approach generally best App-spec tree better for certain apps, but 2x runtime ICCAD'06 David Sheldon et al Fixed order Impact- ordered Tree Application-Specific Impact-ordered Tree Random Impact- ordered Tree Exhaustive Knapsack Speedup Tool Run Time (m) No size constraint, Virtex II Speedup Tool Run Time (m) Size constraint = 80% of full MB size, Virtex II Speedup Tool Run Time (m) Size constraint = 80% of application-specific optimal MB configuration (guaranteed to “hurt”), Virtex II No size constraint, Spartan2 device Speedup Tool Runtime (m) Results are averages for 11 EEMBC benchmarks
Frank Vahid, UC Riverside18/57 Conjoined Processors (ICCAD paper #2) Conjoined processors Two processors sharing a hardware unit to save size (Kumar/Jouppi/Tullsen ISCA 2004) Showed little performance overhead for desktop processors Only research customer is Intel; for soft core processors, research customers are every soft core user How much size savings and performance overhead for conjoined Microblazes? Processor 1Multiplier Processor 2Multiplier Processor 1 Multiplier Processor 2 Conjoined
Frank Vahid, UC Riverside19/57 Conjoined Processors – Size Savings
Frank Vahid, UC Riverside20/57 Conjoined Processors – Performance Overhead We created a trace simulator Reads two instruction traces output by MB simulator Adds 1-cycle delay for every access to a conjoined unit (pessimistic assumption about contention detection scheme) Looks for simultaneous access of shared unit, stalls one MB entirely until unit becomes available Configuration Cycle Latency Barrel Shifter 2 Divider 34 Multiplier 3 FPU Add, Sub, Mul Div 6 30
Frank Vahid, UC Riverside21/57 Conjoined Processors – Performance Overhead Data shown for benchmarks that benefit (>1.3x speedup) from barrel shifter Performance overheads are small (brev),canrdrbrev,(canrdr) (brev),bitmnpbrev,(bitmnp) (brev),brev brev,(brev) (bitmnp),canrdr bitmnp,(canrdr) (bitmnp),bitmnp bitmnp,(bitmnp) (canrdr),canrdr canrdr,(canrdr) Speedup Conjoined Unconjoined
Frank Vahid, UC Riverside22/57 Performance overhead for all benchmark pairs
Frank Vahid, UC Riverside23/57 Customization Considering Conjoinment Developed 0-1 knapsack approach “Disjunctively-Constrained Knapsack Solution” to accomodate conjoinment BaseFP01, BaseFP01 BaseFP01, bitmnp BaseFP01, canrdr bitmnp, bitmnp canrdr, canrdr tblook, bitmnp tblook, canrdr tblook, tblook AVERAGE Speedup knapsack exhaustive w/ conj. exhaustive w/o conj BaseFP01, BaseFP01 BaseFP01, bitmnp BaseFP01, canrdr bitmnp, bitmnp canrdr, canrdr tblook, bitmnp tblook, canrdr tblook, tblook AVERAGE Size (equiv. LUTs) knapsack exhaustive w/ conj. exhaustive w/o conj. Note: To avoid exaggerating the benefits of conjoinment, data only considers benchmark pairs that significantly use a shared unit Only 8 pairings shown due to space limits ICCAD'06 David Sheldon et al
Frank Vahid, UC Riverside24/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work "Warp processing" Standard binaries for FPGAs
Frank Vahid, UC Riverside25/57 Ongoing Work – Design of Experiments Paradigm "Design of Experiments" Well-established discipline (>80 yrs) for tuning parameters For factories, crops, management, etc. Want to set parameter values for best output But each experiment costly, so can't try all combinations Clear mapping of soft core customization to DOE problem Given parameters and # of possible experiments Generates which experiments to run (parameter values) Analyzes resulting data Sound mathematical foundations Present focus of David Sheldon (4 th yr Ph.D.)
Frank Vahid, UC Riverside26/57 Ongoing Work – Design of Experiments Paradigm Suppose time for 12 experiments DOE tool generates which 12 experiments to run User fills in results column Cycles Y
Frank Vahid, UC Riverside27/57 Ongoing Work – Design of Experiments Paradigm DOE tool analyzes results Finds most important factors for given application
Frank Vahid, UC Riverside28/57 Ongoing Work – Design of Experiments Paradigm Results for a different application
Frank Vahid, UC Riverside29/57 Ongoing Work – Design of Experiments Paradigm Interactions among parameters also automatically determined
Frank Vahid, UC Riverside30/57 Ongoing work – System synthesis Given N applications Create customized soft core for each app Criteria: Meet size constraint, minimize total applications' runtime Other criteria possible (e.g., meet runtime constraint, minimize size) Present focus of Ryan Mannion, 3 rd yr Ph.D. App1App2AppN Microblaze Mul I$ PicoBlaze Mul FPU Div Microblaze
Frank Vahid, UC Riverside31/57 Ongoing work – System synthesis Presently use Integer Linear Program Solutions for large set of Xilinx devices generated in seconds Graduate Student: Ryan Mannion, 3 rd yr Ph.D.
Frank Vahid, UC Riverside32/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work Warp processing Standard binaries for FPGAs
Frank Vahid, UC Riverside33/57 Binary-Level Synthesis Binary-level FPGA compiler developed (Greg Stitt, Ph.D. UCR 2007) C++Java asmM obj Compiler Assembler Linker Microproc. Binary Source-level FPGA compiler provides a limited solution Binary-level FPGA compiler Binary-level FPGA compiler provides a more general solution, at the expense of lost high- level information FPGA Binary Microproc. Binary
Frank Vahid, UC Riverside34/57 Binary Synthesis Competitive with Source Level Aggressive decompilation recovers most high-level constructs needed for good synthesis – Makes binary-level synthesis competitive with source level Freescale H264 decoder example, from ISSS/CODES 2005
Frank Vahid, UC Riverside35/57 Binary Synthesis Enables Dynamic Hardware/Software Partitioning Called “Warp Processing” (Vahid/Stitt/Lysecky ) Direct collaborators: Intel, IBM, and Freescale On-chip Binary-level FPGA Compiler Microprocessor FPGA Microproc. Binary FPGA Binary Microproc. Binary Downloader Chip or board
Frank Vahid, UC Riverside36/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary
Frank Vahid, UC Riverside37/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP
Frank Vahid, UC Riverside38/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected
Frank Vahid, UC Riverside39/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD
Frank Vahid, UC Riverside40/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0
Frank Vahid, UC Riverside41/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=
Frank Vahid, UC Riverside42/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := FPGA CLB SM ++
Frank Vahid, UC Riverside43/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04, DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05, TECS'06, U.S. Patent Pending
Frank Vahid, UC Riverside44/57 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? (G. Stitt) Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? (R. Lysecky) µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation
Frank Vahid, UC Riverside45/57 Warp Processors Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution WCLA simplicity results in faster HW circuits
Frank Vahid, UC Riverside46/57 Warp Processors Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis
Frank Vahid, UC Riverside47/57 Warp Processors Speedups Compared with Digital Signal Processor
Frank Vahid, UC Riverside48/57 Warp Processors Speedups for Multi-Threaded Application Benchmarks Compelling computing advantage of FPGAs: Parallellism from bit level up to processor level, and everywhere in between
Frank Vahid, UC Riverside49/57 FPGA Ubiquity via Obscurity Warp processing hides FPGA from languages and tools ANY microprocessor platform extendible with FPGA Maintains "ecosystem": application, tool, and architecture developers New platforms with FPGAs appearing FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries New processor platforms with FPGA evolving
Frank Vahid, UC Riverside50/57 FPGA Standard Binaries? Microprocessor binary represents one form of a "standard binary for FPGAs" Missing is explicit concurrency Parallelism, pipelining, queues, etc. As FPGAs appear in more platforms, might a more general FPGA binary evolve? FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries Binary SystemC? Standard FPGA Compiler Binary Standard FPGA binary? Standard FPGA binaries Ecosystem for FPGAs presently sorely missing
Frank Vahid, UC Riverside51/57 FPGA Standard Binaries? Translator would make best use of existing FPGA resources Could even add FPGA, like adding memory, to improve performance Add more FPGA to your PDA to implement compute-intensive application? Binary FPGAProc. Translator FPGA ************ Binary FPGA Binary Translator FPGA Low-end PDA 100 sec Translator FPGA High-end PDA 1 sec
Frank Vahid, UC Riverside52/57 FPGA Standard Binaries NSF funding received for Xilinx letter of support was helpful Graduate Student: Scott Sirowy, 2 nd year Ph.D.
Frank Vahid, UC Riverside53/57 Future Work – Standard Binary High-level behavior Desktop tool and/or human effort Temporally- oriented binary Spatially-oriented binary Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4 OR * * Binaries
Frank Vahid, UC Riverside54/57 Future Work – Standard Binary * + * * (a) (b) R T s1 Swap in ckt 1 s2 ckt 1 (R = c*d) s3 Swap in ckt 2, saving R s4 ckt 2 (T=a*b+R) (c) (d) * + * 1 2 cdab R T T = a*b + c*d s1 s2 R = c*d T=a*b+R * + mux bdac T R cdab
Frank Vahid, UC Riverside55/57 Future Work – Standard Binary * + * cdab R T T = a*b + c*d Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4 If object 1 input data ready, execute, generate output data If object 2 input data ready, execute, generate output data If object 3 input data ready, execute, generate output data (a) (b) (c)
Frank Vahid, UC Riverside56/57 Future Work – Standard Binary Temporally-oriented binary (TB) Spatially-oriented binary (SB) Hybrid binary (TB + SB) Exploration tool Circuit swapping SB in FPGA Resynthesizing SB to FPGA Recompiling SB to microprocessor Emulating SB on microprocessor Synthesizing TB to FPGA Device- specific information Device-specific binary Standard binary Implementation tools
Frank Vahid, UC Riverside57/57 Conclusions Soft core customization increasingly important to make best use of limited FPGA resources Good initial automatic customization results “Design of Experiments” paradigm looks promising System-level synthesis may yield very useful MB user tool, perhaps web based Warp processing and standard FPGA binary work can help make FPGAs ubiquitous Accomplishments made possible by Xilinx donations and interactions Continued and close collaboration sought