The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
"Standard Binaries for FPGAs" & "eBlocks" Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.
Configurable System-on-Chip: Xilinx EDK
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
"Standard Binaries for FPGAs" & "eBlocks" Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
1 Introduction A digital circuit design is just an idea, perhaps drawn on paper We eventually need to implement the circuit on a physical device –How do.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Exploiting Parallelism
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Introduction to Reconfigurable Computing
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
HIGH LEVEL SYNTHESIS.
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), David Sheldon (3 rd yr PhD), Ryan Mannion (2 nd yr PhD), Scott Sirowy (1 st yr PhD)

Frank Vahid, UC Riverside2/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

Frank Vahid, UC Riverside3/34 FPGAs FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric ab a1a0a1a0 4x2 Memory abab d 1 d 0 F G Implement circuit by downloading particular bits LUT FG 2x2 switch matrix x y a b FPGA SM LUT SM LUT

Frank Vahid, UC Riverside4/34 FPGAs are "Programmable" like Microprocessors – Just Download Bits Processor … … 0010 … Bits loaded into program memory Microprocessor Binaries … Bits loaded into LUTs and SMs FPGA "Binaries" Processor FPGA 0111 … More commonly known as "bitstream" "Software" "Hardware"

Frank Vahid, UC Riverside5/34 FPGA – Why (Sometimes) Better than Microprocessor x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x)

Frank Vahid, UC Riverside6/34 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGA: Why (Sometimes) Better than Microprocessor for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x In general, FPGA better due to circuit's concurrency, from bit-level to task level

Frank Vahid, UC Riverside7/34 Extensive Studies over Past Decade Large speedups on many important applications See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous?

Frank Vahid, UC Riverside8/34 Why FPGAs aren’t Ubiquitous Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology Source: Xilinx 1 million system gate FPGA cost

Frank Vahid, UC Riverside9/34 Why FPGAs aren’t Mainstream Cost Power Extra chip Programming methodology Though tremendous progress in past decade Implementation Assembly code Microprocessor binaryFPGA binary Logic equations / FSMs Register transfers Compilation (1960s, 1970s) Assembling, linking (1950s, 1960s) Behavioral synthesis (1990s) RT synthesis (1980s, 1990s) Logic synthesis, physical design (1970s, 1980s) MicroprocessorsFPGA circuits Automated hardware/software partitioning C/C++/JavaC/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... Application (C/C++/Java/SystemC/Handel-C/Streams-C/…) Downloading

Frank Vahid, UC Riverside10/34 So What’s the Holdup? FPGAs require special compilers Limits adoption – desktop world dominates 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15,000 CAD seats Can't ignore "ecosystem" from separation of applications, tools, and architectures Just consider history of popular processors Binary Applic. Standard Compiler Binary FPGA Binary Microproc Binary FPGAProc. Includes synthesis, tech. map, place & route Special Compiler Architectures Applications Tools Standard binaries

Frank Vahid, UC Riverside11/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

Frank Vahid, UC Riverside12/34 Can we Hide FPGAs from Programmers and Standard Tools? Example Radically different x86 architectures hidden from programmers and tools All execute standard x86 binaries On-chip tools dynamically translate binary to particular architecture Idea: Hide FPGA from programmers and tools Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing Binary SW Profiling Standard Compiler Binary Traditional partitioning done here RISC architecture Translator VLIW architecture Translator FPGAProc. Translator

Frank Vahid, UC Riverside13/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UC Riverside14/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

Frank Vahid, UC Riverside15/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UC Riverside16/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UC Riverside17/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

Frank Vahid, UC Riverside18/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UC Riverside19/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := FPGA CLB SM ++

Frank Vahid, UC Riverside20/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

Frank Vahid, UC Riverside21/34 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside22/34 Decompilation If we don't decompile High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

Frank Vahid, UC Riverside23/34 Decompilation Solution – Recover high-level information from binary: decompilation Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph.D. work of Greg Stitt (Ph.D. UCR 2006) Numerous publications: Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside24/34 Decompilation Results vs. C Compared with synthesis from C Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

Frank Vahid, UC Riverside25/34 Decompilation Results on Optimized H.264 In-depth Study with Freescale Used highly-optimized benchmark Results: Binary approach competitive Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis

Frank Vahid, UC Riverside26/34 Tangent: Simple Coding Guidelines Bring Speedups Closer to Ideal Interesting discovery during H264 study – C style limited speedup Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6.55, C: 6.56 Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs

Frank Vahid, UC Riverside27/34 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside28/34 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: JIT FPGA Compilation DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

Frank Vahid, UC Riverside29/34 Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution Simpler FPGA fabric yields faster HW circuits Currently prototyping our simpler FPGA fabric with Intel, scheduled for Q3 shuttle Overall application speedup average is 7.4

Frank Vahid, UC Riverside30/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

Frank Vahid, UC Riverside31/34 FPGA Ubiquity via Obscurity Warp processing hides FPGA from languages and tools ANY microprocessor platform extendible with FPGA Maintains "ecosystem": application, tool, and architecture developers New platforms with FPGAs appearing FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries New processor platforms with FPGA evolving

Frank Vahid, UC Riverside32/34 FPGA Standard Binaries? Microprocessor binary represents one form of a "standard binary for FPGAs" Missing is explicit concurrency Parallelism, pipelining, queues, etc. As FPGAs appear in more platforms, might a more general FPGA binary evolve? FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries Binary SystemC? Standard FPGA Compiler Binary Standard FPGA binary? Standard FPGA binaries Ecosystem for FPGAs presently sorely missing

Frank Vahid, UC Riverside33/34 FPGA Standard Binaries? Translator makes best use of existing FPGA resources Can even add FPGA, like adding memory, to improve performance Add more FPGA to your PDA to implement compute-intensive application? Binary FPGAProc. Translator FPGA ************ Binary FPGA Binary Translator FPGA Low-end PDA 100 sec Translator FPGA High-end PDA 1 sec

Frank Vahid, UC Riverside34/34 Summary FPGAs may be the new software Hiding FPGA via warp processing is feasible Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Future: Standard binaries for FPGAs? Extensive work to be done Publications can be found at: