Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Slides:

Advertisements

Similar presentations

Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

1 Lecture 2: MIPS Instruction Set Today’s topic:  MIPS instructions Reminder: sign up for the mailing list cs3810 Reminder: set up your CADE accounts.

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

System-on-Chip Design

Code Optimization.

Assembler, Compiler, MIPS simulator

Advanced Architectures

ECE354 Embedded Systems Introduction C Andras Moritz.

EEE2135 Digital Logic Design Chapter 1. Introduction

Microprocessor and Assembly Language

Announcements MP 3 CS296 (Chase Geigle

Techniques for Reducing Read Latency of Core Bus Wrappers

Introduction to Reconfigurable Computing

Methodology of a Compiler that Compresses Code using Echo Instructions

Introduction to cosynthesis Rabi Mahapatra CSCE617

Lecture 4: MIPS Instruction Set

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Chapter 1 Introduction.

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Department of Electrical Engineering Joint work with Jiong Luo

Introduction to Computer Systems

Dynamic Hardware/Software Partitioning: A First Approach

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Warp Processor: A Dynamically Reconfigurable Coprocessor

Loop-Level Parallelism

Introduction to Optimization

Presentation transcript:

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida test

Introduction Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc. test

Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” FPGAs capable of large performance improvements Performance FPGA uP test

Introduction – Hardware/Software Partitioning C Code for FIR Filter Processor FPGA * + . . . . . . . Designer creates custom hardware using hardware description language (HDL) Hardware for loop Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] for (i=0; i < 16; i++) y[i] += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x Compiler Processor Processor ~1000 cycles test

Introduction – High-level Synthesis Updated Binary High-level Code Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler Decompilation High-level Synthesis Hardware Software Libraries/ Object Code Linker Bitstream uP FPGA test

Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Updated Binary High-level Code Decompilation High-level Synthesis Hardware Software Libraries/ Object Code Linker Bitstream uP FPGA test

Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] Decompilation High-level Synthesis * + . . . . . . . test

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication test

Problems with High-Level Synthesis Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Non-Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler Updated Binary High-level Code Decompilation Synthesis Libraries/ Object Code Hardware Software Linker Bitstream uP FPGA test

Warp Processing – “Invisible” Synthesis Decompilation Synthesis Compiler Updated Binary High-level Code Libraries/ Object Code Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uP FPGA Linker Hardware Software test

Warp Processing – “Invisible” Synthesis Libraries/ Object Code Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Updated Binary High-level Code Updated Binary High-Level Code Decompilation Compiler Decompilation Synthesis Updated Binary Software Binary Libraries/ Object Code Hardware Software Decompilation Synthesis Warp processor looks like standard uP but invisibly synthesizes hardware Linker Hardware Software Bitstream uP FPGA test

Warp Processing – “Invisible” Synthesis Libraries/ Object Code Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary High-level Code Updated Binary C, C++, Java, Matlab Updated Binary High-Level Code Decompilation Compiler Decompilation gcc, g++, javac, keil Decompilation Synthesis Updated Binary Software Binary Libraries/ Object Code Hardware Software Decompilation Synthesis Warp processor looks like standard uP but invisibly synthesizes hardware Linker Hardware Software Bitstream uP FPGA test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 2 Microprocessor executes instructions in software binary µP Profiler I Mem µP D$ FPGA On-chip CAD test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 3 Profiler monitors instructions and detects critical regions in binary Critical Loop Detected Profiler Profiler I Mem µP µP beq add beq add beq add beq add beq add beq add beq add beq add beq add beq add D$ FPGA On-chip CAD test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 5 On-chip CAD converts critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA + . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) test

Warp Processing Background: Basic Idea Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA + . . . CLB SM loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) + + test

Warp Processing Background: Basic Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 8 Software-only “Warped” Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA FPGA + . . . CLB SM loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) + + test

Expandable Logic µP µP RAM RAM Profiler Cache Warp Tools Cache Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. RAM Profiler µP Cache Warp Tools DMA FPGA µP Cache Expandable Logic Expandable RAM uP Performance test

Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Performance Portable Gaming Unacceptable Performance test

Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Performance Portable Gaming . . . . User can customize FPGAs to the desired amount of performance Performance improvement is invisible – doesn’t require new binary from the developer test

Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Performance No-FPGA Web Browser Acceptable Performance Platform designer doesn’t have to decide on fixed amount of FPGA. User doesn’t have to pay for FPGA that isn’t needed test

Warp Processing Background: Basic Technology Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary Synthesis Logic Optimization uP I$ D$ FPGA Profiler On-chip CAD Technology Mapping JIT FPGA compilation Placement & Routing Binary HW Binary Updated Binary test

Warp Processing Background: On-Chip CAD Synthesis RT Syn. Log. Opt. Tech. Map Place Route Manually performed 9.1 s 60 MB Xilinx ISE 3.6MB 0.2 s On-chip CAD On a 75Mhz ARM7: only 1.4 s 46x improvement 30% perf. penalty test

Warp Processing: Initial Results - Embedded Applications Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66% test

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication test

Binary Synthesis for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. Warp processors perform synthesis from software binary – “binary synthesis” Problem: No high-level information Synthesis needs high-level constructs > 10x slowdown Can we recover high-level information for synthesis? Make binary synthesis (and Warp processing) competitive with high-level synthesis Compiler Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi r2, r1, 128 Jnz r2, -5 No high-level constructs – arrays, loops, etc. Binary Synthesis Processor FPGA Hardware can be > 10x to 100x test

Decompilation We realized decompilation recovers high-level information But, generally used for binary translation or source-code recovery May not be suitable for synthesis We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain Determined relevant techniques Adapted existing techniques for synthesis test

Decompilation – Control/Data Flow Graph Recovery Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks [Cifuentes 99, 00] Corresponding Assembly Control/Data Flow Graph Creation Original C Code reg3 := 0 reg4 := 0 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 test

Decompilation – Data Flow Analysis Original purpose - remove temporary registers Area overhead – 130% Need new techniques for binary synthesis Corresponding Assembly Data Flow Analysis Original C Code reg3 := 0 reg4 := 0 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 test

Decompilation – Data Flow Analysis Strength Reduction – Compare-with-zero instructions Operator Size Reduction reg4 reg5 Sub reg3 = Branch? Sub reg3, reg4, reg5 Bz reg3, -5 reg4 = reg5 Branch? Optimized DFG Not needed, wastes area 8-bit reg4 5-bit reg5 Optimized DFG Load Byte 16 Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5 32-bit reg4 32-bit reg5 Only 8-bit adder needed 32-bit + 8-bit + 32-bit reg3 8-bit reg3 Area Overhead Reduced to 10% test

Decompilation – Function Recovery Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate Corresponding Assembly Function Recovery Original C Code long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 test

Decompilation – Control Structure Recovery Recover loops, if statements Uses interval analysis techniques [Cifuentes 94] 100% success rate Corresponding Assembly Control Structure Recovery Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; test

Decompilation – Array Recovery Detect linear memory patterns and row-major ordering calculations ~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00] Corresponding Assembly Array Recovery Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; test

Comparison of Decompiled Code and Original Code Decompiled code almost identical to original code Only difference is variable names Binary synthesis is competitive with high-level synthesis Original C Code Decompiled Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; Almost Identical Representations test

Binary Synthesis Tool Flow Initially, high-level source is compiled and linked to form a binary Binary Updated Binary High-level Source Recovers high-level information needed for synthesis Decompilation Decompilation Compiler Libraries/ Object Code Libraries/ Object Code Hw/Sw Estimation Hw/Sw Partitioning Profiling Binary Hardware Software Binary Synthesis Modifies binary to use synthesized hardware Profiling Synthesis Binary Updater Bitstream Updated Binary Hardware Netlists uP FPGA Bitstream ~30,000 lines of C code test

Binary Synthesis is Competitive with High-Level Synthesis Small difference in speedup Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better Commercial products beginning to appear Critical Blue, Binachip test

Binary Synthesis with Software Compiler Optimizations But, binaries generated with few optimizations Optimizations for software may hurt hardware Need new decompilation techniques C code Hardware synthesized from optimized binary may be inefficient SW Compiler Binary is optimized for software Optimized Binary Binary Synthesis uP FPGA test

Loop Rerolling Problem: Loop unrolling may cause inefficient hardware Non-unrolled Loop Unrolled Loop Problem: Loop unrolling may cause inefficient hardware Longer synthesis times Super-linear heuristics Unrolling 100 times => synthesis time is 1002 times longer Larger area requirements Unrolling by compiler unlikely to match unrolling by synthesis Loop structure needed for advanced synthesis techniques Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Synthesis Execution Times Solution: We introduce loop rerolling to undo loop unrolling test

Loop Rerolling – Identifying Unrolled Loops Idea - Identify consecutively repeating instruction sequences BABCABCD String Representation Original C Code x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Binary Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A St a(1), r1 => C Mov r4, r3 => D Map to String x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; abc c d b abcabcd abcd Suffix Tree [Ukkonen 95] Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring test

Loop Rerolling 1) 2) 3) Average Speedup of 1.6x Unrolled Loop Identificiation Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Determine relationship of constants 1) Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Replace constants with induction variable expression 2) reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Rerolled, decompiled code 3) Average Speedup of 1.6x test

Strength Promotion Problem: Strength reduction may cause inefficient hardware + << B[i+1] 4 1 B[i+2] 5 B[i+3] 6 A[i] B[i] 10 * + << B[i+2] 5 1 B[i+3] 6 A[i] B[i] 10 * 18 + << B[i+3] 6 1 A[i] B[i] 10 * 18 34 + A[i] B[i] 10 * 18 34 66 Identify strength-reduced subgraphs Replace with multiplication + << B[i+1] 4 1 B[i] 3 B[i+2] 5 B[i+3] 6 A[i] However, some of the strength reduction was beneficial 1 + B[i+1] 18 B[i] 10 << B[i+2] 5 B[i+3] 6 A[i] * Synthesis reapplies strength reduction to get optimal DFG Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1.5 test

Multiple ISA/Optimization Results What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible What about different instructions sets? Side effects may degrade hardware performance Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis Speedups similar on ARM for –O1 and –O3 optimizations MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedups similar on MIPS for –O1 and –O3 optimizations Speedup test

High-level vs. Binary Synthesis: Proprietary H.264 Decoder High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality MPEG2 H.264 test

High-level vs. Binary Synthesis: Proprietary H.264 Decoder Binary synthesis competitive with high- level synthesis Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x test

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-Threaded Warp Processing Custom Communication test

Thread Warping - Overview Architectural Trend – Include more cores on chip Result – More multi-threaded applications Profiler Warp FPGA b( ) Warp FPGA b( ) OS schedules 4 threads to custom accelerators for (i=0; i < 10; i++) createThread( b ); Function a( ) µP µP b( ) a( ) OS can only schedule 2 threads Warp tools create custom accelerators for b( ) µP µP Warp Tools b( ) Warp Tools OS OS Thread Queue b( ) Remaining 8 threads placed in thread queue 3x more thread parallelism test

Thread Warping - Overview Profiler Profiler detects performance critical loop in b( ) Profiler Warp FPGA Warp FPGA b( ) Warp tools create larger/faster accelerators b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) µP µP b( ) a( ) µP µP Warp Tools Warp Tools b( ) OS Potentially > 100x speedup test

Thread Warping - Results Thread warping 120x faster than 4-uP (ARM) system Comparison of thread warping (TW) and multi-core Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA test

Warp Processing – Custom Communication NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent App1 Performance µP µP Bus Mesh µP µP App2 Performance Bus Mesh test

Warp Processing – Custom Communication NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent App1 FPGA µP FPGA FPGA µP Performance µP Bus Mesh App2 Performance Bus Mesh Warp processing can dynamically choose topology – 2x to 100x improvement Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” test

Summary Updated Binary Any Language Any Compiler Decompilation Standard Binary Decompilation Any Compiler Developer is unaware of FPGA/synthesis Binary HW Binary Synthesis JIT FPGA Compilation Updated Binary Decompilation makes possible uP I$ D$ FPGA Profiler On-chip CAD FPGA Expandable Logic Warp Processing uP Performance Warp processing invisibly achieves > 100x speedups test

References Supported by NSF, SRC, Intel, IBM, Xilinx test Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004 Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-290. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255. Supported by NSF, SRC, Intel, IBM, Xilinx test