Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Similar presentations


Presentation on theme: "Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D."— Presentation transcript:

1 Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine

2 Frank Vahid, UC Riverside 2/52 Self-Improving Cars?

3 Frank Vahid, UC Riverside 3/52 Self-Improving Chips? Moore’s Law 2x capacity growth / 18 months

4 Frank Vahid, UC Riverside 4/52 Extra Capacity  Multicore “Heterogeneous Multicore” – Kumar/Tullsen

5 Frank Vahid, UC Riverside 5/52 FPGA Coprocessing Entering Maintstream Xilinx Virtex II Pro. Source: Xilinx SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs) Xilinx Virtex V. Source: Xilinx AMD Opteron socket plug-ins Xilinx, Altera, … Cray, SGI Mitrionics AMD Opteron Intel QuickAssist IBM Cell (research)

6 Frank Vahid, UC Riverside 6/52 FPGAs “101” (A Quick Intro) FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, plus multipliers, RAM, etc. CAD tools automatically map circuit onto FPGA fabric (Why that name?) ab a1a0a1a0 4x2 Memory abab 1 0 1 0 1 1 1 0 d 1 d 0 F G 00 01 10 11 LUT FG 2x2 switch matrix x y 0 1 0 1 10 a b FPGA SM LUT SM LUT 01 11 01 00 11... 10 11 00 01...

7 Frank Vahid, UC Riverside 7/52 Circuits on FPGAs Can Execute Fast x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Processor FPGA Requires <1 cycle

8 Frank Vahid, UC Riverside 8/52 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. Circuits on FPGAs Can Execute Fast for (i=0; i < 128; i++) y += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x Pipelined -- >500x Circuit parallelism/pipelining can yield big speedups

9 Frank Vahid, UC Riverside 9/52 Circuits on FPGAs Can Execute Fast Large speedups on many important applications Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, …

10 Frank Vahid, UC Riverside 10/52 Background SpecSyn – 1989-1994 (Gajski et al, UC Irvine) Synthesize executable specifications like VHDL or SpecCharts (now SpecC) to microprocessors and custom ASIC circuits FPGAs were just invented and had very little capacity Binary Translation VLIW µP x86 Binary VLIW Binary Performance e.g., HP’s Dynamo; Java JIT compilers; Transmeta Crusoe “code morphing” ~2000: Dynamic Software Optimization/Translation System Synthesis, Hardware/Software Partitioning

11 Frank Vahid, UC Riverside 11/52 Circuits on FPGAs are Software Processor 001010010 … 001010010 … 0010 … Bits loaded into program memory Microprocessor Binaries (Instructions) 001010010 … 01110100... Bits loaded into LUTs and SMs FPGA "Binaries“ (Circuits) Processor FPGA 0111 … More commonly known as "bitstream" "Software" "Hardware"

12 Frank Vahid, UC Riverside 12/52 Circuits on FPGAs are Software “Circuits” often called “hardware” Previously same 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like.” “Software” does not equal “instructions” Software is simply the “bits” Bits may represents instructions, circuits, …

13 Frank Vahid, UC Riverside 13/52 Circuits on FPGAs are Software Sep 2007 IEEE Computer

14 Frank Vahid, UC Riverside 14/52 New FPGA Compilers Make the New Software Even More Familiar Several research compilers DeFacto (USC) ROCCC (Najjar, UCR) Commercial products appearing in recent years CriticalBlue Binary C, C++, Java Profiling FPGA Compiler Binary Micro- processor FPGA Binary HDL Binary Bitstream Synthesis

15 Frank Vahid, UC Riverside 15/52 The New Software – Circuits on FPGAs – May Be Worth Paying Attention To Multi-billion dollar growing industry Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. Recent announcements (e.g, Intel)  FPGAs about to “take off”?? …1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephone-like devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100,000. But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate long- distance through the telegraph, and early phones had poor transmission quality and were limited in range. … http://www.telcomhistory.org/ History repeats itself?

16 Frank Vahid, UC Riverside 16/52 Binary Translation VLIW µP JIT Compilers / Dynamic Translation Extensive binary translation in modern microprocessors x86 Binary VLIW Binary FPGA µP Binary Inspired by binary translators of early 2000s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs Performance e.g., Java JIT compilers; Transmeta Crusoe “code morphing” JIT Compiler / Binary “Translation”

17 Frank Vahid, UC Riverside 17/52 µP FPGA On-chip CAD Warp Processing Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

18 Frank Vahid, UC Riverside 18/52 µP FPGA On-chip CAD Warp Processing Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

19 Frank Vahid, UC Riverside 19/52 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

20 Frank Vahid, UC Riverside 20/52 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

21 Frank Vahid, UC Riverside 21/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

22 Frank Vahid, UC Riverside 22/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

23 Frank Vahid, UC Riverside 23/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA Lean place&route/FPGA  10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

24 Frank Vahid, UC Riverside 24/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps Warp speed, Scotty

25 Frank Vahid, UC Riverside 25/52 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. Ckt. Binary JIT FPGA compilation

26 Frank Vahid, UC Riverside 26/52 Challenge: Decompilation If we don't decompile High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

27 Frank Vahid, UC Riverside 27/52 Decompilation Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation Adapted extensive previous work (for different purposes) Developed new methods (e.g., “reroll” loops) Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

28 Frank Vahid, UC Riverside 28/52 Decompilation Results vs. C Competivive with synthesis from C

29 Frank Vahid, UC Riverside 29/52 Decompilation Results on Optimized H.264 In-depth Study with Freescale Again, competitive with synthesis from C

30 Frank Vahid, UC Riverside 30/52 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Do compiler optimizations generate binaries harder to effectively decompile? (Surprisingly) found opposite – optimized code even better

31 Frank Vahid, UC Riverside 31/52 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

32 Frank Vahid, UC Riverside 32/52 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs -- EDAA Outstanding Dissertation Awardhttp://www.cs.ucr.edu/~vahid/pubs Challenge: JIT Compile to FPGA DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

33 Frank Vahid, UC Riverside 33/52 Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 ARM-Only Execution Overall application speedup average is 7.4 Vs. 200 MHz ARM

34 Frank Vahid, UC Riverside 34/52 µP Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Nom.) FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms  multi- threaded apps

35 Frank Vahid, UC Riverside 35/52 Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Thread Warping Tools Developed framework Uses pthread library (POSIX) Mutex/semaphore for synchronization Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile On-chip CAD FPGA µPµP Accelerator Synthesis Memory Access Synchronization

36 Frank Vahid, UC Riverside 36/52 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) ….

37 Frank Vahid, UC Riverside 37/52 Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads for (i = 0; i < 100; i++) { thread_create( f, a, i ); } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Thread Group Def-Use: a is constant for all threads Addresses of a[0-9] are constant for thread group f() ……………… f() DMA RAM A[0-9] Before MAS: 1000 memory accesses After MAS: 100 memory accesses Data fetched once, delivered to entire group 2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access Execution synchronized by OS enable (from OS)

38 Frank Vahid, UC Riverside 38/52 Memory Access Synchronization (MAS) Also detects overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] Caches reused data, delivers windows to threads Each thread accesses different addresses – but addresses may overlap enable

39 Frank Vahid, UC Riverside 39/52 Speedups from Thread Warping Chose benchmarks with extensive parallelism Compared to 4-ARM device Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

40 Frank Vahid, UC Riverside 40/52 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase On-chip CAD Single-execution speedup FPGA Warping takes time – when useful?

41 Frank Vahid, UC Riverside 41/52 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, …

42 Frank Vahid, UC Riverside 42/52 µP Cache Dynamic Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic

43 Frank Vahid, UC Riverside 43/52 Dynamic Enables Expandable Logic Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform User selects required amount of FPGA No need to recompile/synthesize Recent (Spring 2008) results vs. 3.2 GHz Intel Xeon – 2x-8x speedups Nallatech H101-PCIXM FPGA accelerator board w/ Virtex IV LX100 FPGA. FPGA I/O mems are 8 MB SRAMs. Board connects to host processor over PCI-X bus

44 Frank Vahid, UC Riverside 44/52 44/19 Runtime on CPU alone Ongoing Work: Dynamic Coprocessor Management (CASES’08) CPU Memory FPGA c1c2 c3 c1 c3 Multiple possible applications a1, a2,... Each with pre-designed FPGA coprocessor c1, c2,..., optional use provides speedup The size of FPGA is limited. How to manage the coprocessors? a1a2a3 c1c2c3 a1 App runtime Reconfig time Runtime with cp a2 Loading c2 would require removing c1 or c3. Is it worthwhile? Depends on pattern of future instances of a1, a2, a3 Must make “online” decision App instance

45 Frank Vahid, UC Riverside 45/52 The Ski-Rental Problem Greedy: Always load Doesn’t consider past apps, which may predict future Solution idea for “ski rental problem” (popular online technique) Ski-Rental Problem You decide to take up skiing Should you rent skis each trip, or buy? Popular online algorithm solution – Rent until cumulative rental cost equals cost of buying, then buy Guarantee never to pay >2x cost of buying

46 Frank Vahid, UC Riverside 46/52 Cumulative Benefit Heuristic Maintain cumulative time benefit for each coprocessor Benefit of coprocessor i: tpi - tci cbenefit(i) = cbenefit(i) + (tpi – tci) Time that coprocessor i would have saved up to this point had it always been used for app i Only consider loading coproc i if cbenefit(i) > loading_time(i) Resists loading coprocs that are infrequent or with little speedup Q = a1 a2 a3 tpi 200 100 50 tci 10 20 25 Benefit: tpi-tci 190 80 25 c1:190380570 c2:80 c3:25 160 50 Cumulative benefit table Assume loading time for all coprocessors is 200 Loads = <--,c1 380>200 190!>200 --, 25!>200 --, --> (already loaded)

47 Frank Vahid, UC Riverside 47/52 Cumulative Benefit Heuristic – Replacing Coprocessors Replacement policy Subset of resident coprocessors such that cbenefit(i) – loading_time(i) > cbenefit(CP) Intuition – Don’t replace higher-benefit coprocessors with lower ones FPGA c1 Memory c1c2 c3 c2 ? cbenefit > loading_time, but good enough to replace c1 or c2? c1: 950 c2: c3: 320 Cumulative benefit table 225 Q = <..., a3 200 Loading time is 200 225>200  can consider load But 225-200 !> 320  DON’T load Greedy heuristic, maintains sorted cumulative benefit list Time complexity is O(n)

48 Frank Vahid, UC Riverside 48/52 Adjustment for temporal locality Real application sequences exhibit temporal locality Extend heuristic to “fade” cumulative benefit values Multiply by “f” at each step, 0<=f<=1 Define f proportional to reconfiguration time Small reconfig time – reconfig more freely, less attention to past, small f Q = c1: 950 c2: c3: 320 Cumulative benefit table 200 760 128 160 +80 e.g., f = 0.8...249...224...100 a1 a2 a3 tpi 200 100 50 tci 10 20 25 Benefit: tpi-tci 190 80 25

49 Frank Vahid, UC Riverside 49/52 Experiments Our online ACBenefit algorithm gets better results than previous online algs RAW Random BiasedPeriodic Avg FPGA speedup: 10x Avg coprocessor gate count: 48,000 FPGA size set to 60,000 FPGA reconfig time App sequence total runtime

50 Frank Vahid, UC Riverside 50/52 More Dynamic Configuration: Configurable Cache Example W1 Four Way Set Associative Base Cache W2W3W4 W1 Two Way Set Associative W2W3W4 W1 Direct mapped cache W2W3W4 W1 Shut down two ways W2W3W4 Gnd Vdd Bitline Gated-Vdd Control Way Concatenation Way Shutdown Counter bus W1 16 bytes 4 physical lines filled when line size is 32 bytes Off Chip Memory Line Concatenation [Zhang/Vahid/Najjar, ISCA 2003, ISVLSI 2003, TECS 2005] One physical cache, can be dynamically reconfigured to 18 different caches 40% avg savings

51 Frank Vahid, UC Riverside 51/52 Highly-Configurable Platforms Dynamic tuning of configurable components also Micro- processor L1 cache L2 cache Micro- processor L1 cache Voltage/freq RF size Branch pred. Total size Associativity Line size Total size Associativity Line size Memory Encoding schemes Application1 Application2 Dynamically tuning the configurable components to match the currently executing application can significantly reduce power (and even improve performance)

52 Frank Vahid, UC Riverside 52/52 Software is no longer just "instructions" The sw elephant has a (new) tail – FPGA circuits Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) Extensive future work Online CAD algorithms, online architectures and algorithms,... Microprocessor instructions FPGA circuits Summary

53 Frank Vahid, UC Riverside 53/52 Warp Processors CAD-Oriented FPGA Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation µPµP I$ D$ WCLA Profiler DPM Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Add architecture features for SW kernels Enables development of fast, lean JIT FPGA compilation tools 1s < 1s.5 MB 1 MB < 1s 1 MB 10s 3.6 MB WCLA

54 Frank Vahid, UC Riverside 54/52 ARM I$ D$ WCLA Profiler DPM Warp Processors Warp Configurable Logic Architecture (WCLA) Warp Configurable Logic Architecture (WCLA) Need a fast, efficient coprocessor interface Analyzed digital signal processors (DSP) and existing coprocessors Data address generators (DADG) and Loop control hardware (LCH) Provide fast loop execution Supports memory accesses with regular access pattern Integrated 32-bit multiplier-accumulator (MAC) Frequently found in within critical SW kernels Fast, single-cycle multipliers are large and require many interconnections ARM I$ D$ WCLA Profiler DPM DADG & LCH Configurable Logic Fabric Reg0 32-bit MAC Reg1 Reg2 WCLA A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

55 Frank Vahid, UC Riverside 55/52 DADG LCH Configurable Logic Fabric 32-bit MAC Warp Processors - WCLA Configurable Logic Fabric SM CLB SM CLB SM CLB SM CLB Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) CLB is directly connected to a SM Along with SM design, allows for design of lean JIT routing µPµP I$ D$ FPGA DPM µPµP I$ D$ WCLA DPM WCLA A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

56 Frank Vahid, UC Riverside 56/52 Warp Processors - WCLA Combinational Logic Block Combinational Logic Block Incorporate two 3-input 2-output LUTs Equivalent to four 3-input LUTs with fixed internal routing Allows for good quality circuit while reducing JIT technology mapping complexity Provide routing resources between adjacent CLBs to support carry chains Reduces number of nets we need to route FPGAsWCLA Flexibility/Density: Large CLBs, various internal routing resources Simplicity: Limited internal routing, reduce on-chip CAD complexity LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB µPµP I$ D$ FPGA DPM µPµP I$ D$ WCLA DPM WCLA A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

57 Frank Vahid, UC Riverside 57/52 Warp Processors - WCLA Switch Matrix 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L Switch Matrix All nets are routed using only a single pair of channels throughout the configurable logic fabric Each short channel is associated with single long channel Designed for fast, lean JIT FPGA routing FPGAsWCLA Flexibility/Speed: Large routing resources, various routing options Simplicity: Allow for design of fast, lean routing algorithm µPµP I$ D$ FPGA DPM µPµP I$ D$ WCLA DPM WCLA A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

58 Frank Vahid, UC Riverside 58/52 µPµP I$ D$ WCLA Profiler DPM (CAD) Warp Processors JIT FPGA Compilation Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

59 Frank Vahid, UC Riverside 59/52 Warp Processors ROCM – Riverside On-Chip Minimizer ROCM - Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] Utilizes a single expand phase instead of multiple iterations Eliminate the need to compute the off-set to reduce memory usage On average only 2% larger than optimal solution On-Chip Logic Minimization, DAC’03 A Codesigned On-Chip Logic Minimizer, CODES+ISSS’03 ExpandReduceIrredundant dc-seton-setoff-set JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

60 Frank Vahid, UC Riverside 60/52 Warp Processors - Results Execution Time and Memory Requirements 1 MB 1s 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing On-Chip Logic Minimization, DAC’03 A Codesigned On-Chip Logic Minimizer, CODES+ISSS’03

61 Frank Vahid, UC Riverside 61/52 Warp Processors ROCTM – Riverside On-Chip Technology Mapper Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.) Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

62 Frank Vahid, UC Riverside 62/52 Warp Processors - Results Execution Time and Memory Requirements 1s < 1s.5 MB 1 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

63 Frank Vahid, UC Riverside 63/52 Warp Processors ROCPLACE – Riverside On-Chip Placer ROCPLACE - Placement Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine placement Attempt to use adjacent CLB routing whenever possible CLB Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

64 Frank Vahid, UC Riverside 64/52 Warp Processors - Results Execution Time and Memory Requirements 1s < 1s.5 MB 1 MB < 1s 1 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

65 Frank Vahid, UC Riverside 65/52 Warp Processors Routing FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of resources If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets VPR [Betz, et al., 1997] Increased performance over Pathfinder Routability-driven: Use fewest tracks possible Timing-driven: Optimize circuit speed Many techniques are used in commercial FPGA CAD tools 1 1 1 1 1 1 1 1 1 2 congestion 2 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

66 Frank Vahid, UC Riverside 66/52 SM CLB SM CLB SM CLB Routing Resource Graph 0/4 SM Resource Graph ROCR - Riverside On-Chip Router Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is smaller Produces circuits with critical path 10% shorter than VPR (RD) Warp Processors ROCR – Riverside On-chip Router JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Route Rip-up Done! illegal? no yes Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04


Download ppt "Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D."

Similar presentations


Ads by Google