Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Graduate Computer Architecture I Lecture 16: FPGA Design.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Configurable System-on-Chip: Xilinx EDK

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Introduction to cosynthesis Rabi Mahapatra CSCE617

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale Contributing Ph.D. Students: Roman Lysecky (2005, now asst. prof. at U. Arizona), Greg Stitt (Ph.D. 2006), Kris Miller (MS 2007), David Sheldon (3 rd yr PhD), Scott Sirowy (1 st yr PhD)

Frank Vahid, UC Riverside2/36 Outline Intro and Background: Warp Processors Work in progress under SRC 3-yr grant 1.Parallelized-computation memory access 2.Deriving high-level constructs from binaries 3.Case studies 4.Using commercial FPGA fabrics 5.Application-specific FPGA Other ongoing related work Configurable cache tuning

Frank Vahid, UC Riverside3/36 Intro: Partitioning to FPGA Custom ASIC coprocessor known to speedup sw kernels Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04) Power savings even on FPGA (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) Con: more silicon (~10x), less power savings Pro: platform fully programmable Mass-produced ASIC Proc. Application FPGA Proc. Application

Frank Vahid, UC Riverside4/36 Intro: FPGA vs. ASIC Coprocessor – FPGA Surprisingly Competitive FPGA 34% savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) 70% energy savings & 5.4 speedup vs. 200 MHz MIPS (Stitt/Vahid DATE’05)

Frank Vahid, UC Riverside5/36 FPGA – Why (Sometimes) Better than Software x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit ReversalHardware for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x) sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Other big reason: Concurrency

Frank Vahid, UC Riverside6/36 µPµP I$ D$ FPGA Profiler Dynamic Part. Module (DPM) Partitioned application executes faster with lower energy consumption 5 Warp Processing – Dynamic Partitioning of Sw Kernels to FPGA Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 FPGA

Frank Vahid, UC Riverside7/36 Warp Processors – Dynamic Partitioning Advantages vs. compiler-time partitioning No special compilers Completely transparent Separates function and architecture for architectures having FPGAs Avoid complexities of supporting different FPGAs Potentially brings FPGA advantages to ALL software Binary SW Profiling Standard Compiler Binary Profiling CAD Tools Traditional partitioning done here FPGAProc. FPGAProc. FPGAProc. Profiling CAD Tools FPGAProc. Profiling CAD Tools

Frank Vahid, UC Riverside8/36 µPµP I$ D$ WCLA (FPGA) Profiler DPM (CAD) Warp Processing Steps (On-Chip CAD) Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

Frank Vahid, UC Riverside9/36 Warp Processing – Partitioning Applications spend much time in small amount of code rule Observed 75-4 rule for MediaBench, NetBench Potentially large perfomance/ energy benefits from implementing critical regions in hardware Use profiling results to identify critical regions µPµP I$ D$ WCLA (FPGA) Profiler DPM (CAD)

Frank Vahid, UC Riverside10/36 Warp Processing – Decompilation Synthesis from binary has a challenge High-level information (e.g., loops, arrays) lost during compilation Solution –Recover high-level information: decompilation Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations

Frank Vahid, UC Riverside11/36 Warp Processing – Decompilation Earlier study Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

Frank Vahid, UC Riverside12/36 Warp Processing – RT Synthesis Maps decompiled DFG operations to hw library components Adders, Comparators, Multiplexors, Shifters Creates Boolean expression for each output bit in dataflow graph r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. r1 r2 + r4 r38 < r5 32-bit adder 32-bit comparator

Frank Vahid, UC Riverside13/36 Warp Processing – JIT FPGA Compilation Existing FPGAs require complex CAD tools FPGAs designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and large memory usage Not suitable for dynamic on-chip execution 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB Solution: Develop a custom CAD-oriented FPGA (WCLA – Warp Configurable Logic Architecture) Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Add architecture features for SW kernels Enables development of fast, lean JIT FPGA compilation tools 1s < 1s.5 MB 1 MB < 1s 1 MB 10s 3.6 MB

Frank Vahid, UC Riverside14/36 Warp Configurable Logic Architecture (WCLA) Data address generators (DADG) and loop control hardware(LCH) Provide fast loop execution Supports memory accesses with regular access pattern Integrated 32-bit multiplier-accumulator (MAC) Frequently found within critical SW kernels ARM I$ D$ WCLA Profiler DPM DADG & LCH Configurable Logic Fabric Reg0 32-bit MAC Reg1 Reg2 DATE’04

Frank Vahid, UC Riverside15/36 Warp Configurable Logic Architecture (WCLA) CAD-specialized configurable logic fabric Simplified switch matrices Directly connected to adjacent CLB All nets are routed using only a single pair of channels Allows for efficient routing Simplified CLBs Two 3 input, 2 output LUTs Each CLB connected to adjacent CLB to simplify routing of carry chains Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle) 0 0L 1 1L 2L 2 3L L 1L 2L 3L L1L2L3L L1L2L 3L LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB DATE’04

Frank Vahid, UC Riverside16/36 Warp Processing – Logic Synthesis ROCM - Riverside On-Chip Minimizer Two-level minimization tool Combination of approaches from Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] Single expand phase instead of multiple iterations Eliminate need to compute off-set – reduces memory usage On average only 2% larger than optimal solution On-Chip Logic Minimization, DAC’03 A Codesigned On-Chip Logic Minimizer, CODES+ISSS’03 ExpandReduceIrredundant dc-seton-setoff-set JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

Frank Vahid, UC Riverside17/36 Warp Processing – Technology Mapping Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.) Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

Frank Vahid, UC Riverside18/36 Warp Processing – Placement ROCPLACE - Placement Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine placement Attempt to use adjacent CLB routing whenever possible CLB Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

Frank Vahid, UC Riverside19/36 ROCR - Riverside On-Chip Router Requires much less memory than VPR as resource graph is smaller 10x faster execution time than VPR (Timing driven) Produces circuits with critical path 10% shorter than VPR (Routablilty driven) Warp Processing – Routing JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04

Frank Vahid, UC Riverside20/36 Experiments with Warp Processing Warp Processor ARM/MIPS plus our fabric Riverside on-chip CAD tools to map critical region to configurable fabric Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation Traditional HW/SW Partitioning ARM/MIPS plus Xilinx Virtex-E FPGA Manually partitioned software using VHDL VHDL synthesized using Xilinx ISE 4.1 ARM I$ D$ WCLA Profiler DPM ARM I$ D$ Xilinx Virtex-E FPGA

Frank Vahid, UC Riverside21/36 Warp Processors Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution WCLA simplicity results in faster HW circuits

Frank Vahid, UC Riverside22/36 Warp Processors Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis

Frank Vahid, UC Riverside23/36 Warp Processors - Results Execution Time and Memory Requirements 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s DPM (CAD) (75MHz ARM7) 3.6MB 0.2 s DPM (CAD)

Frank Vahid, UC Riverside24/36 Outline Intro and Background: Warp Processors Work in progress under SRC 3-yr grant 1.Parallelized-computation memory access 2.Deriving high-level constructs from binaries 3.Case studies 4.Using commercial FPGA fabrics 5.Application-specific FPGA Other ongoing related work Configurable cache tuning

Frank Vahid, UC Riverside25/36 1. Parallelized-Computation Memory Access Problem Can parallelize kernel computation, but may hit memory bottleneck Solution Use more advanced memory/compilation methods A[i]c[i] + A[i+1]c[i+1] + A[i+1]c[i+1] + B[i]B[i+1] B[i+2] Parallelism can’t be exploited if data isn’t available

Frank Vahid, UC Riverside26/36 1. Parallelized-Computation Memory Access Method 1: Distribute data among FPGA block RAMS Concurrently accessible A[i]c[i] + A[i+1]c[i+1] + B[i]B[i+1] Main Memory A[i]c[i] + A[i+1]c[i+1] + B[i]B[i+1] blockRAM Memory accesses are parallelized

Frank Vahid, UC Riverside27/36 1. Parallelized-Computation Memory Access Method 2: Smart Buffers (Najjar’2004) Memory structure optimized for application’s access patterns Takes advantage of data reuse Speedups of 2x to 10x compared to hw without smart buffers A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. 1 st iteration window 2nd iteration window 3rd iteration window Controller Smart Buffer Datapath Smart Buffer RAM Smart Buffer Empty A[0]A[1]A[2]A[3]A[4]KilledA[5]Killed A[6] RAM

Frank Vahid, UC Riverside28/36 2. Deriving high-level constructs from binaries Problem Some binary features unsuitable for synthesis Loops unrolled by optimizing compiler, or pointers Previous decompilation techniques didn’t consider Features OK for sw-to-sw translation, not for synthesis Solution – New decompilation techniques Convert pointers to arrays Reroll loops Others for (int i=0; i < 3; i++) accum += a[i]; Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 for (int i=0; i<3;i++) reg1 += array[i]; Loop UnrollingLoop Rerolling

Frank Vahid, UC Riverside29/36 2. Deriving high-level constructs from binaries Recent study of decompilation robustness In presence of compiler optimizations, and instruction sets Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze ICCAD’05 DATE’04

Frank Vahid, UC Riverside30/36 3. Case Studies Compare warp processing (binary level) versus compiler-based (C level) partitioning for real examples H.264 study (w/ Freescale) Highly-optimized proprietary C code Results of 2 month study Competitive Also learned that simple C-coding guidelines improve synthesis Whether done from binary or source; presently developing guidelines More examples: IBM (server), others...

Frank Vahid, UC Riverside31/36 4. Using Commercial FPGA Fabrics Can warp processing utilize commercial FPGAs? Approach 1: “Virtual FPGA” – Map our fabric to FPGA Collaboration with Xilinx Initial results: 6x performance overhead, 100x area overhead Main problem is routing uP Commercial FPGA Warp fabric uP Warp fabric “Virtual FPGA” Map fabric onto a commercial fabric uP Investigating better methods (one-to-one mapping)

Frank Vahid, UC Riverside32/36 5. Application-Specific FPGA Commercial FPGAs intended for ASIC prototyping Huge range of possible designs Generality causes loss of efficiency Propose to investigate app-spec FPGAs Put on ASIC next to custom circuits and microprocessor FPGA tuned to particular circuit, but still general and reprogrammable Supports late changes, modifications to standards, etc. Customize CLB size, # of inputs, routing resources Coarse-grained components – multiply- accumulate, RAM, etc. Use retargetable CAD tool Expected results: smaller, faster FPGAs CLB MAC RAM CLB General Fabric DSP-tuned Fabric

Frank Vahid, UC Riverside33/36 5. Application-Specific FPGA Initial results: Performance improvements up to 300%, area reductions up to 90%

Frank Vahid, UC Riverside34/36 Outline Intro and Background: Warp Processors Work in progress under SRC 3-yr grant 1.Parallelized-computation memory access 2.Deriving high-level constructs from binaries 3.Case studies 4.Using commercial FPGA fabrics 5.Application-specific FPGA Other ongoing related work Configurable cache tuning

Frank Vahid, UC Riverside35/36 Configurable Cache Tuning Developed Runtime configurable cache (ISCA 2003) Configuration heuristics (DATE 2004, ISLPED 2005) – 60% memory- access energy savings Present focus: Dynamic tuning Way 1 Way 2 Way 1 Way2 Way 3 Way 4 ISLPED 2005 Ways, line size, and total size are configurable

Frank Vahid, UC Riverside36/36 Summary Basic warp technology Developed (NSF, and 1-year CSR grants from SRC) Uses binary synthesis and FPGAs Conclusion: Feasible technology, much potential Ongoing work (SRC) Improve and validate effectiveness of binary synthesis Examine FPGA implementation issues Extensive future work to develop robust warp technology