Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Configurable System-on-Chip: Xilinx EDK
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Hardware/Software Partitioning Greg Stitt ECE Department University of Florida.
Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Online SystemC Emulation Acceleration
Presentation transcript:

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx Collaborators: David Sheldon (4 th yr UCR PhD student), Roman Lysecky (PhD UCR 2005, now Asst. Prof. at U. Arizona), Rakesh Kumar (PhD UCSD 2006, now Asst. Prof. at UIUC), Dean Tullsen (Prof. at UCSD)

Frank Vahid, UC Riverside2/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work "Warp processing" Standard binaries for FPGAs

Frank Vahid, UC Riverside3/57 Microblaze Customization (ICCAD paper #1) FPGAs an increasingly popular software platform FPGA soft core processor Microprocessor synthesized onto FPGA fabric Soft core customization Cores come with configurable parameters Xilinx Microblaze comes with several instantiatable units: multiplier, barrel shifter, divider, FPU, or cache Customization: Tuning soft core parameters to a specific application Micro- processor Mul Micro- processor BS Div FPU I$ App1 Mul I$ Micro- processor App2 Mul FPU Div

Frank Vahid, UC Riverside4/57 Instantiable Unit Speedups Instantiating units can yield significant speedups “base” – Microblaze without any optional units instantiated

Frank Vahid, UC Riverside5/57 Customization Tradeoffs Data for aifir EEMBC benchmark on Xilinx Microblaze synthesized to Virtex device 2x performance tradeoff 4.5x size tradeoff

Frank Vahid, UC Riverside6/57 “Size” on an FPGA Defining a circuit’s “size” on an FPGA requires some work Different resources Lookup tables (LUTs) Embedded multipliers Embedded block RAM (BRAM) Our solution: Define “equivalent LUTs” for multipliers and BRAM Based on total LUTs, multipliers, and BRAMs in a “full” Microblaze Later found to closely match Xilinx’s “equivalent gates” concept Image courtesy of Xilinx Regular LUTs Mult 18x18 Equiv LUTs Barrel Shifter2280 Divider1220 Multiplier Floating Point Unit Base MB15700 Full MB

Frank Vahid, UC Riverside7/57 Goal: Customize Soft Core to Minimize Application Runtime With and without size constraint Even without size constraint, must take care because some units reduce clock frequency and thus may slow runtime Slower!

Frank Vahid, UC Riverside8/57 Goal : Customize Soft Core to Minimize Application Runtime With and without size constraint Even without size constraint, must take care because some units reduce clock frequency and thus may slow runtime "Full MB" – MB with all units instantiated

Frank Vahid, UC Riverside9/57 Key Problem Related to Core Customization Problem: Synthesis of one soft core configuration, and execution/simulation on that configuration, requires about 15 minutes Thus, for reasonable customization tool runtimes, can only synthesize 5-10 configurations in search of best one

Frank Vahid, UC Riverside10/57 Two Solution Approaches Traditional CAD approach Pre-characterize using synthesis and execution/simulation, create abstract problem model, solve using CAD exploration algorithms Used 0-1 knapsack formulation Synthesis-in-the-loop approach Run synthesis and execute/simulate application while exploring More accurate Pre-characterize Model Explore Synthesis and execution/simulation Explore Synthesis and execution/simulation start finish start finish Typically some form of graph 5-10 executions 5-10 interations

Frank Vahid, UC Riverside11/57 Traditional CAD Approach Map to 0-1 knapsack problem 0-1 knapsack problem Given set of items, each with value and weight Maximize value of items in a weight-constrained knapsack Mapping Item: Instantiatable unit Value of an item: Speedup increment when instantiating the unit, vs. base MB Weight of an item: Equivalent LUTs Knapsack weight constraint: Equivalent LUTs constraint Mul Micro- processor BS Div FPU I$ App1 Mul I$ “Items” “Knapsack” Speedup increment MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache

Frank Vahid, UC Riverside12/57 Traditional CAD Approach Speedup increment MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache Size (equiv LUTs) MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache Unit’s size determined by synthesizing MB with only that unit instantiated Requires 5 synthesis runs, +1 for base About 1 hour Unit’s speedup increment determined by comparing runtime of application with and without the unit Well-known knapsack heuristic uses value/weight ratio, applies dynamic programming. Complexity is O(n*W) n: number of items, W: knapsack constraint Runtime is negligible (seconds) Different for every application

Frank Vahid, UC Riverside13/57 Problem with Traditional CAD Approach Does not consider interactions among units Speedup increments may not be additive for given application e.g., Mul  0.4, BS  0.3, but Mul & BS  0.6, not 0.7 Thus, not a perfect mapping to the 0-1 knapsack problem Because item weights don’t add perfectly Component Cache Floating Point Divider Multiplier Barrel Shifter5.2 %1.0 %0.0 %10.4 % Multiplier6.7 %1.9 %26.0 % Divider2.9 %0.0 % Floating Point5.1 % Average pairwise speedup-increment additive inaccuracies for all pairs of benchmarks

Frank Vahid, UC Riverside14/57 Synthesis-in-the-Loop Approach View solution space as tree Level: Instantiate given unit? If gives speedup and fits, instantiate Use unit speedup/size to order tree “Impact-ordered tree” approach 5 synthesis runs, +1 for base, to determine units’ speedup/size Then requires 5 more synthesis runs to descend through tree 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache YesNo base+bs base Barrel shifter Multiplier Divider FPU Cache base+bs base+bs+mul base base+mul

Frank Vahid, UC Riverside15/57 Synthesis-in-the-Loop Approach View solution space as tree, each level a decision for a unit, order levels by unit speedup/size for application 11 synthesis runs make take a few hours To reduce, can consider using pre- determined order Determined by soft core vendor based on averages over many benchmarks 1000*Speedup/Size MultiplierBarrel Shifter Floating Point Unit DividerMCH Cache YesNo base+div base Barrel shifter Multiplier Divider FPU Cache Application- specific impact- ordered tree Divider Barrel Shifter Multiplier FPU Cache Fixed impact- ordered tree

Frank Vahid, UC Riverside16/57 Synthesis-in-the-Loop Approach Data for fixed impact-ordered tree for 11 EEMBC benchmarks Speedup MultiplierBarrel Shifter FPUDividerCache Size (Equiv LUTs) MultiplierBarrel Shifter FPUDividerCache Speedup/Size MultiplierBarrel Shifter FPUDividerCache

Frank Vahid, UC Riverside17/57 Customization Results Fixed tree approach generally best App-spec tree better for certain apps, but 2x runtime ICCAD'06 David Sheldon et al Fixed order Impact- ordered Tree Application-Specific Impact-ordered Tree Random Impact- ordered Tree Exhaustive Knapsack Speedup Tool Run Time (m) No size constraint, Virtex II Speedup Tool Run Time (m) Size constraint = 80% of full MB size, Virtex II Speedup Tool Run Time (m) Size constraint = 80% of application-specific optimal MB configuration (guaranteed to “hurt”), Virtex II No size constraint, Spartan2 device Speedup Tool Runtime (m) Results are averages for 11 EEMBC benchmarks

Frank Vahid, UC Riverside18/57 Conjoined Processors (ICCAD paper #2) Conjoined processors Two processors sharing a hardware unit to save size (Kumar/Jouppi/Tullsen ISCA 2004) Showed little performance overhead for desktop processors Only research customer is Intel; for soft core processors, research customers are every soft core user How much size savings and performance overhead for conjoined Microblazes? Processor 1Multiplier Processor 2Multiplier Processor 1 Multiplier Processor 2 Conjoined

Frank Vahid, UC Riverside19/57 Conjoined Processors – Size Savings

Frank Vahid, UC Riverside20/57 Conjoined Processors – Performance Overhead We created a trace simulator Reads two instruction traces output by MB simulator Adds 1-cycle delay for every access to a conjoined unit (pessimistic assumption about contention detection scheme) Looks for simultaneous access of shared unit, stalls one MB entirely until unit becomes available Configuration Cycle Latency Barrel Shifter 2 Divider 34 Multiplier 3 FPU Add, Sub, Mul Div 6 30

Frank Vahid, UC Riverside21/57 Conjoined Processors – Performance Overhead Data shown for benchmarks that benefit (>1.3x speedup) from barrel shifter Performance overheads are small (brev),canrdrbrev,(canrdr) (brev),bitmnpbrev,(bitmnp) (brev),brev brev,(brev) (bitmnp),canrdr bitmnp,(canrdr) (bitmnp),bitmnp bitmnp,(bitmnp) (canrdr),canrdr canrdr,(canrdr) Speedup Conjoined Unconjoined

Frank Vahid, UC Riverside22/57 Performance overhead for all benchmark pairs

Frank Vahid, UC Riverside23/57 Customization Considering Conjoinment Developed 0-1 knapsack approach “Disjunctively-Constrained Knapsack Solution” to accomodate conjoinment BaseFP01, BaseFP01 BaseFP01, bitmnp BaseFP01, canrdr bitmnp, bitmnp canrdr, canrdr tblook, bitmnp tblook, canrdr tblook, tblook AVERAGE Speedup knapsack exhaustive w/ conj. exhaustive w/o conj BaseFP01, BaseFP01 BaseFP01, bitmnp BaseFP01, canrdr bitmnp, bitmnp canrdr, canrdr tblook, bitmnp tblook, canrdr tblook, tblook AVERAGE Size (equiv. LUTs) knapsack exhaustive w/ conj. exhaustive w/o conj. Note: To avoid exaggerating the benefits of conjoinment, data only considers benchmark pairs that significantly use a shared unit Only 8 pairings shown due to space limits ICCAD'06 David Sheldon et al

Frank Vahid, UC Riverside24/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work "Warp processing" Standard binaries for FPGAs

Frank Vahid, UC Riverside25/57 Ongoing Work – Design of Experiments Paradigm "Design of Experiments" Well-established discipline (>80 yrs) for tuning parameters For factories, crops, management, etc. Want to set parameter values for best output But each experiment costly, so can't try all combinations Clear mapping of soft core customization to DOE problem Given parameters and # of possible experiments Generates which experiments to run (parameter values) Analyzes resulting data Sound mathematical foundations Present focus of David Sheldon (4 th yr Ph.D.)

Frank Vahid, UC Riverside26/57 Ongoing Work – Design of Experiments Paradigm Suppose time for 12 experiments DOE tool generates which 12 experiments to run User fills in results column Cycles Y

Frank Vahid, UC Riverside27/57 Ongoing Work – Design of Experiments Paradigm DOE tool analyzes results Finds most important factors for given application

Frank Vahid, UC Riverside28/57 Ongoing Work – Design of Experiments Paradigm Results for a different application

Frank Vahid, UC Riverside29/57 Ongoing Work – Design of Experiments Paradigm Interactions among parameters also automatically determined

Frank Vahid, UC Riverside30/57 Ongoing work – System synthesis Given N applications Create customized soft core for each app Criteria: Meet size constraint, minimize total applications' runtime Other criteria possible (e.g., meet runtime constraint, minimize size) Present focus of Ryan Mannion, 3 rd yr Ph.D. App1App2AppN Microblaze Mul I$ PicoBlaze Mul FPU Div Microblaze

Frank Vahid, UC Riverside31/57 Ongoing work – System synthesis Presently use Integer Linear Program Solutions for large set of Xilinx devices generated in seconds Graduate Student: Ryan Mannion, 3 rd yr Ph.D.

Frank Vahid, UC Riverside32/57 Outline Two UCR ICCAD’06 papers Microblaze customization Microblaze conjoining (and customization) Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core systems Related FPGA work Warp processing Standard binaries for FPGAs

Frank Vahid, UC Riverside33/57 Binary-Level Synthesis Binary-level FPGA compiler developed (Greg Stitt, Ph.D. UCR 2007) C++Java asmM obj Compiler Assembler Linker Microproc. Binary Source-level FPGA compiler provides a limited solution Binary-level FPGA compiler Binary-level FPGA compiler provides a more general solution, at the expense of lost high- level information FPGA Binary Microproc. Binary

Frank Vahid, UC Riverside34/57 Binary Synthesis Competitive with Source Level Aggressive decompilation recovers most high-level constructs needed for good synthesis – Makes binary-level synthesis competitive with source level Freescale H264 decoder example, from ISSS/CODES 2005

Frank Vahid, UC Riverside35/57 Binary Synthesis Enables Dynamic Hardware/Software Partitioning Called “Warp Processing” (Vahid/Stitt/Lysecky ) Direct collaborators: Intel, IBM, and Freescale On-chip Binary-level FPGA Compiler Microprocessor FPGA Microproc. Binary FPGA Binary Microproc. Binary Downloader Chip or board

Frank Vahid, UC Riverside36/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UC Riverside37/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

Frank Vahid, UC Riverside38/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UC Riverside39/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UC Riverside40/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

Frank Vahid, UC Riverside41/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UC Riverside42/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := FPGA CLB SM ++

Frank Vahid, UC Riverside43/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04, DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05, TECS'06, U.S. Patent Pending

Frank Vahid, UC Riverside44/57 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? (G. Stitt) Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? (R. Lysecky) µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside45/57 Warp Processors Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution WCLA simplicity results in faster HW circuits

Frank Vahid, UC Riverside46/57 Warp Processors Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis

Frank Vahid, UC Riverside47/57 Warp Processors Speedups Compared with Digital Signal Processor

Frank Vahid, UC Riverside48/57 Warp Processors Speedups for Multi-Threaded Application Benchmarks Compelling computing advantage of FPGAs: Parallellism from bit level up to processor level, and everywhere in between

Frank Vahid, UC Riverside49/57 FPGA Ubiquity via Obscurity Warp processing hides FPGA from languages and tools ANY microprocessor platform extendible with FPGA Maintains "ecosystem": application, tool, and architecture developers New platforms with FPGAs appearing FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries New processor platforms with FPGA evolving

Frank Vahid, UC Riverside50/57 FPGA Standard Binaries? Microprocessor binary represents one form of a "standard binary for FPGAs" Missing is explicit concurrency Parallelism, pipelining, queues, etc. As FPGAs appear in more platforms, might a more general FPGA binary evolve? FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries Binary SystemC? Standard FPGA Compiler Binary Standard FPGA binary? Standard FPGA binaries Ecosystem for FPGAs presently sorely missing

Frank Vahid, UC Riverside51/57 FPGA Standard Binaries? Translator would make best use of existing FPGA resources Could even add FPGA, like adding memory, to improve performance Add more FPGA to your PDA to implement compute-intensive application? Binary FPGAProc. Translator FPGA ************ Binary FPGA Binary Translator FPGA Low-end PDA 100 sec Translator FPGA High-end PDA 1 sec

Frank Vahid, UC Riverside52/57 FPGA Standard Binaries NSF funding received for Xilinx letter of support was helpful Graduate Student: Scott Sirowy, 2 nd year Ph.D.

Frank Vahid, UC Riverside53/57 Future Work – Standard Binary High-level behavior Desktop tool and/or human effort Temporally- oriented binary Spatially-oriented binary Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4 OR * * Binaries

Frank Vahid, UC Riverside54/57 Future Work – Standard Binary * + * * (a) (b) R T s1 Swap in ckt 1 s2 ckt 1 (R = c*d) s3 Swap in ckt 2, saving R s4 ckt 2 (T=a*b+R) (c) (d) * + * 1 2 cdab R T T = a*b + c*d s1 s2 R = c*d T=a*b+R * + mux bdac T R cdab

Frank Vahid, UC Riverside55/57 Future Work – Standard Binary * + * cdab R T T = a*b + c*d Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4 If object 1 input data ready, execute, generate output data If object 2 input data ready, execute, generate output data If object 3 input data ready, execute, generate output data (a) (b) (c)

Frank Vahid, UC Riverside56/57 Future Work – Standard Binary Temporally-oriented binary (TB) Spatially-oriented binary (SB) Hybrid binary (TB + SB) Exploration tool Circuit swapping SB in FPGA Resynthesizing SB to FPGA Recompiling SB to microprocessor Emulating SB on microprocessor Synthesizing TB to FPGA Device- specific information Device-specific binary Standard binary Implementation tools

Frank Vahid, UC Riverside57/57 Conclusions Soft core customization increasingly important to make best use of limited FPGA resources Good initial automatic customization results “Design of Experiments” paradigm looks promising System-level synthesis may yield very useful MB user tool, perhaps web based Warp processing and standard FPGA binary work can help make FPGAs ubiquitous Accomplishments made possible by Xilinx donations and interactions Continued and close collaboration sought