Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) ______???__________ This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation, Intel, Freescale, and IBM Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine
Frank Vahid, UC Riverside 2/45 Self-Improving Cars?
Frank Vahid, UC Riverside 3/45 Self-Improving Chips? Moore’s Law 2x capacity growth / 18 months
Frank Vahid, UC Riverside 4/45 Extra Capacity Multicore “Heterogeneous Multicore” – Kumar/Tullsen
Frank Vahid, UC Riverside 5/45 Extra Capacity FPGAs Xilinx Virtex II Pro. Source: XilinxAltera Excalibur. Source: Altera Cray XD1. Source: FPGA journal, Apr’05 Xilinx, Altera, … Cray, SGI Mitrionics AMD Opteron Intel QuickAssist IBM Cell (research) What are “FPGAs”??
Frank Vahid, UC Riverside 6/45 FPGAs “101” (A Quick Intro) FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, plus multipliers, RAM, etc. CAD tools automatically map circuit onto FPGA fabric (Why that name?) ab a1a0a1a0 4x2 Memory abab d 1 d 0 F G LUT FG 2x2 switch matrix x y a b FPGA SM LUT SM LUT
Frank Vahid, UC Riverside 7/45 Circuits on FPGAs Can Execute Fast x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires <1 cycle
Frank Vahid, UC Riverside 8/45 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. Circuits on FPGAs Can Execute Fast for (i=0; i < 128; i++) y += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x Pipelined -- >500x Circuit parallelism/pipelining can yield big speedups
Frank Vahid, UC Riverside 9/45 Circuits on FPGAs Can Execute Fast Large speedups on many important applications Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, …
Frank Vahid, UC Riverside 10/45 Circuits on FPGAs are Software “Circuits” often called “hardware” Previously same 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like.” “Software” does not equal “instructions” Software is simply the “bits” Bits may represents instructions, circuits, …
Frank Vahid, UC Riverside 11/45 Circuits on FPGAs are Software Processor … … 0010 … Bits loaded into program memory Microprocessor Binaries (Instructions) … Bits loaded into LUTs and SMs FPGA "Binaries“ (Circuits) Processor FPGA 0111 … More commonly known as "bitstream" "Software" "Hardware"
Frank Vahid, UC Riverside 12/45 Circuits on FPGAs are Software Sep 2007 IEEE Computer
Frank Vahid, UC Riverside 13/45 New FPGA Compilers Make the New Software Even More Familiar Several research compilers DeFacto (USC) ROCCC (Najjar, UCR) Commercial products appearing in recent years CriticalBlue Binary C, C++, Java Profiling FPGA Compiler Binary Micro- processor FPGA Binary HDL Binary Bitstream Synthesis
Frank Vahid, UC Riverside 14/45 The New Software – Circuits on FPGAs – May Be Worth Paying Attention To Multi-billion dollar growing industry Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. Recent announcements (e.g, Intel) FPGAs about to “take off”?? …1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephone-like devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100,000. But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate long- distance through the telegraph, and early phones had poor transmission quality and were limited in range. … History repeats itself?
Frank Vahid, UC Riverside 15/45 Binary Translation VLIW µP JIT Compilers / Dynamic Translation Extensive binary translation in modern microprocessors x86 Binary VLIW Binary FPGA µP Binary Inspired by binary translators of early 2000s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs Performance e.g., Java JIT compilers; Transmeta Crusoe “code morphing” JIT Compiler / Binary “Translation”
Frank Vahid, UC Riverside 16/45 µP FPGA On-chip CAD Warp Processing Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary
Frank Vahid, UC Riverside 17/45 µP FPGA On-chip CAD Warp Processing Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP
Frank Vahid, UC Riverside 18/45 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected
Frank Vahid, UC Riverside 19/45 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD
Frank Vahid, UC Riverside 20/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits
Frank Vahid, UC Riverside 21/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=
Frank Vahid, UC Riverside 22/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD
Frank Vahid, UC Riverside 23/45 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps Warp speed, Scotty
Frank Vahid, UC Riverside 24/45 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. Ckt. Binary JIT FPGA compilation
Frank Vahid, UC Riverside 25/45 Challenge: Decompilation If we don't decompile High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone
Frank Vahid, UC Riverside 26/45 Decompilation Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation Adapted extensive previous work (for different purposes) Developed new methods (e.g., “reroll” loops) Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville) Numerous publications: Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation
Frank Vahid, UC Riverside 27/45 Decompilation Results vs. C Competivive with synthesis from C
Frank Vahid, UC Riverside 28/45 Decompilation Results on Optimized H.264 In-depth Study with Freescale Again, competitive with synthesis from C
Frank Vahid, UC Riverside 29/45 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Do compiler optimizations generate binaries harder to effectively decompile? (Surprisingly) found opposite – optimized code even better
Frank Vahid, UC Riverside 30/45 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation
Frank Vahid, UC Riverside 31/45 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: -- EDAA Outstanding Dissertation Awardhttp:// Challenge: JIT Compile to FPGA DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools
Frank Vahid, UC Riverside 32/45 Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 ARM-Only Execution Overall application speedup average is 7.4 Vs. 200 MHz ARM
Frank Vahid, UC Riverside 33/45 µP Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Cand.) FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms multi- threaded apps
Frank Vahid, UC Riverside 34/45 Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Thread Warping Tools Fairly complex framework Uses pthread library (POSIX) Mutex/semaphore for synchronization Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile On-chip CAD FPGA µPµP Accelerator Synthesis Memory Access Synchronization
Frank Vahid, UC Riverside 35/45 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) ….
Frank Vahid, UC Riverside 36/45 Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads for (i = 0; i < 100; i++) { thread_create( f, a, i ); } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Thread Group Def-Use: a is constant for all threads Addresses of a[0-9] are constant for thread group f() ……………… f() DMA RAM A[0-9] Before MAS: 1000 memory accesses After MAS: 100 memory accesses Data fetched once, delivered to entire group 2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access Execution synchronized by OS enable (from OS)
Frank Vahid, UC Riverside 37/45 Memory Access Synchronization (MAS) Also detects overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] Caches reused data, delivers windows to threads Each thread accesses different addresses – but addresses may overlap enable
Frank Vahid, UC Riverside 38/45 Speedups from Thread Warping Chose benchmarks with extensive parallelism Compared to 4-ARM device Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s
Frank Vahid, UC Riverside 39/45 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase On-chip CAD Single-execution speedup FPGA Warping takes time – when useful?
Frank Vahid, UC Riverside 40/45 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, …
Frank Vahid, UC Riverside 41/45 µP Cache Dynamic Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic
Frank Vahid, UC Riverside 42/45 Dynamic Enables Expandable Logic Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform User selects required amount of FPGA No need to recompile/synthesize
Frank Vahid, UC Riverside 43/45 Dynamic enables Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2
Frank Vahid, UC Riverside 44/45 Dynamic enables Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP
Frank Vahid, UC Riverside 45/45 Software is no longer just "instructions" The sw elephant has a (new) tail – FPGA circuits Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) Extensive future work… Microprocessor instructions FPGA circuits Summary