Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

TO COMPUTERS WITH BASIC CONCEPTS Lecturer: Mohamed-Nur Hussein Abdullahi Hame WEEK 1 M. Sc in CSE (Daffodil International University)

Computer Abstractions and Technology

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Device Tradeoffs Greg Stitt ECE Department University of Florida.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.

Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

EE3A1 Computer Hardware and Digital Design

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Buffering Techniques Greg Stitt ECE Department University of Florida.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Introduction to Reconfigurable Computing

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Chapter 1 Introduction.

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Presentation transcript:

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) ______???__________ This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation, Intel, Freescale, and IBM Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine

Frank Vahid, UC Riverside 2/45 Self-Improving Cars?

Frank Vahid, UC Riverside 3/45 Self-Improving Chips? Moore’s Law 2x capacity growth / 18 months

Frank Vahid, UC Riverside 4/45 Extra Capacity  Multicore “Heterogeneous Multicore” – Kumar/Tullsen

Frank Vahid, UC Riverside 5/45 Extra Capacity  FPGAs Xilinx Virtex II Pro. Source: XilinxAltera Excalibur. Source: Altera Cray XD1. Source: FPGA journal, Apr’05 Xilinx, Altera, … Cray, SGI Mitrionics AMD Opteron Intel QuickAssist IBM Cell (research) What are “FPGAs”??

Frank Vahid, UC Riverside 6/45 FPGAs “101” (A Quick Intro) FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, plus multipliers, RAM, etc. CAD tools automatically map circuit onto FPGA fabric (Why that name?) ab a1a0a1a0 4x2 Memory abab d 1 d 0 F G LUT FG 2x2 switch matrix x y a b FPGA SM LUT SM LUT

Frank Vahid, UC Riverside 7/45 Circuits on FPGAs Can Execute Fast x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires <1 cycle

Frank Vahid, UC Riverside 8/45 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. Circuits on FPGAs Can Execute Fast for (i=0; i < 128; i++) y += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x Pipelined -- >500x Circuit parallelism/pipelining can yield big speedups

Frank Vahid, UC Riverside 9/45 Circuits on FPGAs Can Execute Fast Large speedups on many important applications Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, …

Frank Vahid, UC Riverside 10/45 Circuits on FPGAs are Software “Circuits” often called “hardware” Previously same 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like.” “Software” does not equal “instructions” Software is simply the “bits” Bits may represents instructions, circuits, …

Frank Vahid, UC Riverside 11/45 Circuits on FPGAs are Software Processor … … 0010 … Bits loaded into program memory Microprocessor Binaries (Instructions) … Bits loaded into LUTs and SMs FPGA "Binaries“ (Circuits) Processor FPGA 0111 … More commonly known as "bitstream" "Software" "Hardware"

Frank Vahid, UC Riverside 12/45 Circuits on FPGAs are Software Sep 2007 IEEE Computer

Frank Vahid, UC Riverside 13/45 New FPGA Compilers Make the New Software Even More Familiar Several research compilers DeFacto (USC) ROCCC (Najjar, UCR) Commercial products appearing in recent years CriticalBlue Binary C, C++, Java Profiling FPGA Compiler Binary Micro- processor FPGA Binary HDL Binary Bitstream Synthesis

Frank Vahid, UC Riverside 14/45 The New Software – Circuits on FPGAs – May Be Worth Paying Attention To Multi-billion dollar growing industry Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. Recent announcements (e.g, Intel)  FPGAs about to “take off”?? …1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephone-like devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100,000. But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate long- distance through the telegraph, and early phones had poor transmission quality and were limited in range. … History repeats itself?

Frank Vahid, UC Riverside 15/45 Binary Translation VLIW µP JIT Compilers / Dynamic Translation Extensive binary translation in modern microprocessors x86 Binary VLIW Binary FPGA µP Binary Inspired by binary translators of early 2000s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs Performance e.g., Java JIT compilers; Transmeta Crusoe “code morphing” JIT Compiler / Binary “Translation”

Frank Vahid, UC Riverside 16/45 µP FPGA On-chip CAD Warp Processing Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UC Riverside 17/45 µP FPGA On-chip CAD Warp Processing Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

Frank Vahid, UC Riverside 18/45 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UC Riverside 19/45 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UC Riverside 20/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

Frank Vahid, UC Riverside 21/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UC Riverside 22/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA Lean place&route/FPGA  10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

Frank Vahid, UC Riverside 23/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps Warp speed, Scotty

Frank Vahid, UC Riverside 24/45 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. Ckt. Binary JIT FPGA compilation

Frank Vahid, UC Riverside 25/45 Challenge: Decompilation If we don't decompile High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

Frank Vahid, UC Riverside 26/45 Decompilation Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation Adapted extensive previous work (for different purposes) Developed new methods (e.g., “reroll” loops) Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville) Numerous publications: Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside 27/45 Decompilation Results vs. C Competivive with synthesis from C

Frank Vahid, UC Riverside 28/45 Decompilation Results on Optimized H.264 In-depth Study with Freescale Again, competitive with synthesis from C

Frank Vahid, UC Riverside 29/45 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Do compiler optimizations generate binaries harder to effectively decompile? (Surprisingly) found opposite – optimized code even better

Frank Vahid, UC Riverside 30/45 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside 31/45 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: -- EDAA Outstanding Dissertation Awardhttp:// Challenge: JIT Compile to FPGA DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

Frank Vahid, UC Riverside 32/45 Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 ARM-Only Execution Overall application speedup average is 7.4 Vs. 200 MHz ARM

Frank Vahid, UC Riverside 33/45 µP Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Cand.) FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms  multi- threaded apps

Frank Vahid, UC Riverside 34/45 Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Thread Warping Tools Fairly complex framework Uses pthread library (POSIX) Mutex/semaphore for synchronization Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile On-chip CAD FPGA µPµP Accelerator Synthesis Memory Access Synchronization

Frank Vahid, UC Riverside 35/45 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) ….

Frank Vahid, UC Riverside 36/45 Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads for (i = 0; i < 100; i++) { thread_create( f, a, i ); } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Thread Group Def-Use: a is constant for all threads Addresses of a[0-9] are constant for thread group f() ……………… f() DMA RAM A[0-9] Before MAS: 1000 memory accesses After MAS: 100 memory accesses Data fetched once, delivered to entire group 2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access Execution synchronized by OS enable (from OS)

Frank Vahid, UC Riverside 37/45 Memory Access Synchronization (MAS) Also detects overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] Caches reused data, delivers windows to threads Each thread accesses different addresses – but addresses may overlap enable

Frank Vahid, UC Riverside 38/45 Speedups from Thread Warping Chose benchmarks with extensive parallelism Compared to 4-ARM device Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

Frank Vahid, UC Riverside 39/45 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase On-chip CAD Single-execution speedup FPGA Warping takes time – when useful?

Frank Vahid, UC Riverside 40/45 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, …

Frank Vahid, UC Riverside 41/45 µP Cache Dynamic Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic

Frank Vahid, UC Riverside 42/45 Dynamic Enables Expandable Logic Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform User selects required amount of FPGA No need to recompile/synthesize

Frank Vahid, UC Riverside 43/45 Dynamic enables Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2

Frank Vahid, UC Riverside 44/45 Dynamic enables Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP

Frank Vahid, UC Riverside 45/45 Software is no longer just "instructions" The sw elephant has a (new) tail – FPGA circuits Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) Extensive future work… Microprocessor instructions FPGA circuits Summary