Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Slides:

Advertisements

Similar presentations

Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Goal: Write Programs in Assembly

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Configurable System-on-Chip: Xilinx EDK

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

1 Lecture 2: MIPS Instruction Set Today’s topic:  MIPS instructions Reminder: sign up for the mailing list cs3810 Reminder: set up your CADE accounts.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Operating Systems A Biswas, Dept. of Information Technology.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Introduction to Reconfigurable Computing

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Warp Processor: A Dynamically Reconfigurable Coprocessor

Presentation transcript:

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

2/55 Introduction Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc.

3/55 Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” uP FPGA Performance FPGAs capable of large performance improvements

4/55 Introduction – Hardware/Software Partitioning for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 16; i++) y[i] += c[i] * x[i].. C Code for FIR Filter Processor ~1000 cycles Compiler Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] Processor FPGA ************ Designer creates custom hardware using hardware description language (HDL) Hardware for loop ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x

5/55 Introduction – High-level Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilatio n High-level Synthesis Bitstream uPFPGA Linker Hardware Software Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler

6/55 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Bitstream uPFPGA Linker Hardware Software Decompilation High-level Synthesis

7/55 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] ************ Decompilation High-level Synthesis

8/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication

9/55 Problems with High-Level Synthesis Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Non- Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler

10/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow

11/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Warp processor looks like standard uP but invisibly synthesizes hardware

12/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary C, C++, Java, Matlab Decompilation gcc, g++, javac, keil Warp processor looks like standard uP but invisibly synthesizes hardware

13/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

14/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

15/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

16/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

17/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD converts critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

18/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

19/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA

20/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

21/55 µP Cache Expandable Logic RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic

22/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Portable Gaming Performance Unacceptable Performance

23/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Portable Gaming Performance.. User can customize FPGAs to the desired amount of performance Performance improvement is invisible – doesn’t require new binary from the developer

24/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Web Browser Performance Acceptable Performance No-FPGA Platform designer doesn’t have to decide on fixed amount of FPGA. User doesn’t have to pay for FPGA that isn’t needed

25/55 uP I$ D$ FPGA Profiler On-chip CAD Warp Processing Background: Basic Technology Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary HW Synthesis Technology Mapping Placement & Routing Logic Optimization Binary Updated Binary JIT FPGA compilation

26/55 Warp Processing Background: On-Chip CAD 60 MB 9.1 s Xilinx ISE Manually performed 3.6MB 0.2 s On-chip CAD On a 75Mhz ARM7: only 1.4 s 46x improvement 30% perf. penalty Log. Opt. Tech. Map Place Route RT Syn. Synthesis

27/55 Warp Processing: Initial Results - Embedded Applications Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66%

28/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication

29/55 Binary Synthesis Warp processors perform synthesis from software binary – “binary synthesis” Problem: No high-level information Synthesis needs high-level constructs > 10x slowdown Can we recover high-level information for synthesis? Make binary synthesis (and Warp processing) competitive with high- level synthesis for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 128; i++) y[i] += c[i] * x[i].. Compiler Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi r2, r1, 128 Jnz r2, -5 No high-level constructs – arrays, loops, etc. Binary Synthesis Processor FPGA Hardware can be > 10x to 100x

30/55 Decompilation We realized decompilation recovers high-level information But, generally used for binary translation or source- code recovery May not be suitable for synthesis We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain Determined relevant techniques Adapted existing techniques for synthesis

31/55 Decompilation – Control/Data Flow Graph Recovery Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks [Cifuentes 99, 00] Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly

32/55 Decompilation – Data Flow Analysis Original purpose - remove temporary registers Area overhead – 130% Need new techniques for binary synthesis Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis

33/55 Decompilation – Data Flow Analysis Strength Reduction – Compare-with-zero instructions Operator Size Reduction Sub reg3, reg4, reg5 Bz reg3, -5 reg4reg5 Sub reg3 = 0 Branch? Not needed, wastes area 32-bit reg4 32-bit + 32-bit reg5 32-bit reg3 Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5 8-bit + 8-bit reg3 Only 8-bit adder needed reg4 = reg5 Branch? Optimized DFG Area Overhead Reduced to 10% 8-bit reg45-bit reg5 Optimized DFG Load Byte 16

34/55 Decompilation – Function Recovery Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery

35/55 Decompilation – Control Structure Recovery Recover loops, if statements Uses interval analysis techniques [Cifuentes 94] 100% success rate long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly

36/55 Decompilation – Array Recovery Detect linear memory patterns and row-major ordering calculations ~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00] Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery

37/55 Comparison of Decompiled Code and Original Code Decompiled code almost identical to original code Only difference is variable names Binary synthesis is competitive with high-level synthesis long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Decompiled Code Almost Identical Representations

38/55 Libraries/ Object Code Binary Synthesis Tool Flow Binary Synthesis Binary Decompilation Hardware Software Libraries/ Object Code Hardware Netlists Bitstream Profiling Synthesis Profiling Binary Updater Hw/Sw Estimation Hw/Sw Partitioning Profiling Updated Binary High-level Source Decompilation Compiler Binary Bitstream uPFPGA Updated Binary Initially, high-level source is compiled and linked to form a binary Recovers high- level information needed for synthesis Modifies binary to use synthesized hardware ~30,000 lines of C code

39/55 Binary Synthesis is Competitive with High-Level Synthesis Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better Commercial products beginning to appear Critical Blue, Binachip Small difference in speedup

40/55 Binary Synthesis with Software Compiler Optimizations But, binaries generated with few optimizations Optimizations for software may hurt hardware Need new decompilation techniques C code SW Compiler Optimized Binary uPFPGA Binary Synthesis Binary is optimized for software Hardware synthesized from optimized binary may be inefficient

41/55 Loop Rerolling Solution: We introduce loop rerolling to undo loop unrolling Problem: Loop unrolling may cause inefficient hardware Longer synthesis times Super-linear heuristics Unrolling 100 times => synthesis time is times longer Larger area requirements Unrolling by compiler unlikely to match unrolling by synthesis Loop structure needed for advanced synthesis techniques Synthesis Execution Times Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Non-unrolled Loop Unrolled Loop

42/55 Loop Rerolling – Identifying Unrolled Loops x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Original C Code Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Map to String BABCABCD String Representation Idea - Identify consecutively repeating instruction sequences abc c d b abcabcd c abcd d d d Suffix Tree [Ukkonen 95]

43/55 Loop Rerolling Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Original C Code Unrolled Loop Identificiation Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Determine relationship of constants 1) Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Replace constants with induction variable expression 2) reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Rerolled, decompiled code 3) x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Average Speedup of 1.6x

44/55 Strength Promotion << B[i+1] << B[i]3 1 + << B[i+2] << B[i+3] A[i] However, some of the strength reduction was beneficial Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1.5 Identify strength- reduced subgraphs << B[i+1] << B[i+2] << B[i+3] A[i] B[i]10 * Replace with multiplication << B[i+2] << B[i+3] A[i] B[i]10 * B[i]18 * << B[i+3] A[i] B[i]10 * B[i]18 * B[i]34 * A[i] B[i]10 * B[i]18 * B[i]34 * B[i]66 * B[i+1]18 B[i]10 + << B[i+2] << B[i+3] 6 + A[i] * * Synthesis reapplies strength reduction to get optimal DFG Problem: Strength reduction may cause inefficient hardware

45/55 Multiple ISA/Optimization Results What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible What about different instructions sets? Side effects may degrade hardware performance Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedup

46/55 High-level vs. Binary Synthesis : Proprietary H.264 Decoder MPEG2 H.264 High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality

47/55 High-level vs. Binary Synthesis : Proprietary H.264 Decoder Binary synthesis was competitive with high- level synthesis High-level speedup – 6.56x Binary speedup – 6.55x Binary synthesis competitive with high- level synthesis

48/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-Threaded Warp Processing Custom Communication

49/55 Thread Warping - Overview Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) OS Thread Queue b( ) Warp Tools b( ) Warp FPGA b( ) OS can only schedule 2 threads Remaining 8 threads placed in thread queue Warp tools create custom accelerators for b( ) OS schedules 4 threads to custom accelerators 3x more thread parallelism Architectural Trend – Include more cores on chip Result – More multi-threaded applications

50/55 Thread Warping - Overview Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) Warp Tools b( ) Profiler Profiler detects performance critical loop in b( ) Warp FPGA b( ) Warp tools create larger/faster accelerators b( ) Potentially > 100x speedup

51/55 Thread Warping - Results Thread warping 120x faster than 4-uP (ARM) system Comparison of thread warping (TW) and multi-core Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA

52/55 Warp Processing – Custom Communication µP Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Performance

53/55 Warp Processing – Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology – 2x to 100x improvement FPGA µP FPGA µP Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” Performance

54/55 Summary uP I$ D$ FPGA Profiler On-chip CAD Updated Binary Any Language Updated Binary Standard Binary Decompilation Any Compiler Developer is unaware of FPGA/synthesis Binary HW Binary Synthesis JIT FPGA Compilation Binary Updated Binary Decompilation makes possible FPGA Expandable Logic Warp Processing uP Performance Warp processing invisibly achieves > 100x speedups

55/55 References Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) 4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp Supported by NSF, SRC, Intel, IBM, Xilinx