Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Device Tradeoffs Greg Stitt ECE Department University of Florida.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Week 1- Fall 2009 Dr. Kimberly E. Newman University of Colorado.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

1 ENG236: Introduction (1) Rocky K. C. Chang THE HONG KONG POLYTECHNIC UNIVERSITY.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Configurable, reconfigurable, and run-time reconfigurable computing.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

System-on-Chip Design Hao Zheng Comp Sci & Eng U of South Florida 1.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

System-on-Chip Design

Introduction to Reconfigurable Computing

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Introduction to cosynthesis Rabi Mahapatra CSCE617

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Presentation transcript:

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

2/26 Introduction Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc.

3/26 Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” uP FPGA Performance FPGAs capable of large performance improvements

4/26 Introduction – Hardware/Software Partitioning for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 16; i++) y[i] += c[i] * x[i].. C Code for FIR Filter Processor ~1000 cycles Compiler Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] Processor FPGA ************ Designer creates custom hardware using hardware description language (HDL) Hardware for loop ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x

5/26 Introduction – High-level Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilatio n High-level Synthesis Bitstream uPFPGA Linker Hardware Software Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler

6/26 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] ************ Decompilation High-level Synthesis

7/26 Problems with High-Level Synthesis Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Non- Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler

8/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow

9/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Warp processor looks like standard uP but invisibly synthesizes hardware

10/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary C, C++, Java, Matlab Decompilation gcc, g++, javac, keil Warp processor looks like standard uP but invisibly synthesizes hardware

11/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

12/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

13/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

14/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

15/26 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD converts critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

16/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

17/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA

18/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

19/55 uP I$ D$ FPGA Profiler On-chip CAD Warp Processing Background: Basic Technology Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary HW Synthesis Technology Mapping Placement & Routing Logic Optimization Binary Updated Binary JIT FPGA compilation 46x improvement 30% perf. penalty

20/26 Warp Processing: Initial Results - Embedded Applications Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66%

21/26 µP Thread Warping - Overview FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi- core platforms  multi- threaded apps

22/26 Speedup from Thread Warping Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

23/26 Dynamic enables Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2

24/26 Dynamic enables Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP Problem: Best topology is application dependent

Summary Warp processors Achieves performance advantages of FPGA without any extra effort “Invisible” synthesis Allows designers to use existing tools/languages Enables dynamic hardware optimization Thread warping Dynamic synthesis of thread accelerators for multi-cores Custom communication Warp processing can adapt communication topology to needs of application or a particular workload 25/26

26/26 References Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) 4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp