Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

UC Berkeley B. Nikolić Architecture choices MAC Unit Addr Gen  P Prog Mem Embedded Processor (lpArm) Direct Mapped Hardware Embedded FPGA DSP (e.g. TI.

Configurable System-on-Chip: Xilinx EDK

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Exploring SOPC Performance Across FPGA Architectures Franjo Plavec June 9, 2006.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

An Automated Hardware/Software Co-Design

New Opportunities for Computer Architecture Research Using High-Density FPGAs and Design Tools Nahi Abdul-Ghani, Patrick Akl, Mohammad El-Majzoub, Maroulla.

Floating-Point FPGA (FPFPGA)

Application-Specific Customization of Soft Processor Microarchitecture

Techniques for Reducing Read Latency of Core Bus Wrappers

Ann Gordon-Ross and Frank Vahid*

A Self-Tuning Configurable Cache

Portable SystemC-on-a-Chip

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Online SystemC Emulation Acceleration

Presentation transcript:

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine b Department of Computer Science and Engineering University of California, San Diego c Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

David Sheldon, UC Riverside 2 of 22 FPGA Soft Core Processors Soft-core Processor HDL description Flexible implementation FPGA or ASIC Technology independent HDL Description FPGAASIC Spartan 3Virtex 2Virtex 4

David Sheldon, UC Riverside 3 of 22 FPGA Soft Core Processors Soft Core Processors can have configurable options Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPGA μPμP Cache FPU MAC

David Sheldon, UC Riverside 4 of 22 Conjoinment Overview Base micro- processor FPU Base micro- processor FPU Application 1 Application 2 “Conjoining” Add necessary units to both processors Conjoin the FPU Unit Conjoined FPU unit

David Sheldon, UC Riverside 5 of 22 Conjoinment Background Conjoinment proposed for multicore desktop processing (Kumar 2004) Reduces size with reasonable performance overhead e.g., cache conjoinment overhead: 1%-13% ICache SharingDCache Sharing

David Sheldon, UC Riverside 6 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?

David Sheldon, UC Riverside 7 of 22 Area Savings Significant potential area savings Limitations Does not consider multiplexing costs Due to absence of FPGA synthesis tools supporting conjoinment But good potential justifies further investigation Base MicroBlaze Multiplier Barrel Shifter Divider FPU Unit Size Multiplier Barrel Shifter Divider FPU % 4% 23% 32%

David Sheldon, UC Riverside 8 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?

David Sheldon, UC Riverside 9 of 22 Performance Overhead No simulator exists for conjoined processors We developed our own Trace-based conjoined processor simulator Conj. simulator Simulation uses pessimistic performance assumptions Kumar's techniques can improve Simulator outputs contention information Final cycles can be compared to unconjoined to determine performance overhead brev bitmnp Xilinx simulator app1 app2 trace1trace2 Access stall Contention stall

David Sheldon, UC Riverside 10 of 22 Performance Overhead brev bitmnp 17% 2.4% Speedup: Application time on optimally configured processor / avg. app. time on base processor Compared configuration with conjoinment versus without Performance overhead usually small, averaged just 4.2% Overhead caused by access delays and contention of the hardware units

David Sheldon, UC Riverside 11 of 22 Outline Conjoinment for soft-core FPGA processors Area savings Performance overhead Tuning heuristic for two configurable soft-cores with conjoin option size perf ?

David Sheldon, UC Riverside 12 of 22 NO FPU Tuning Heuristic 5 choices per unit e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined 4 units  5 4 = 625 possible configurations Simulation: ~30 minutes per configuration Need search heuristic to tune Base MicroBlaze 1 Base MicroBlaze 2 FPU 2 FPU conjoined Multiplier Barrel Shifter Divider Multiplier FPU 1

David Sheldon, UC Riverside 13 of 22 Map to 0-1 Knapsack Problem MicroBlaze Multiplier size perf Divider size perf size perf Barrel Shifter perf size FPU BS Perf increment Size increment FPUMULDIV Perf/Size Creating the model Synthesis MicroBlaze FPU Synthesis App Base

David Sheldon, UC Riverside 14 of 22 Map to 0-1 Knapsack Problem First consider tuning without conjoinment Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem Add items, each with weight and benefit, to weight- constrained knapsack such that profit maximized MUL FPU 1 Base MicroBlaze MUL FPU 2 Available FPGA Base MicroBlaze Items: Weights: Benefits: Knapsack Note: Mapping inexact – weights/benefits not strictly additive MUL 1 FPU 1 MUL 2

David Sheldon, UC Riverside 15 of 22 Disjunctively Constrained Knapsack Problem: If conjoined unit included, can't also include standalone unit Solution: Map to disjunctively-constrained 0-1 knapsack Yanada T., “Heuristic and Exact Algorithms for the Disjunctively Constrained Knapsack Problem”, 2002 Prohibits specific item pairs from being in the knapsack ILP solution, running time is pseudo polynomial Base MicroBlaze Available FPGA Base MicroBlaze Knapsack MUL FPU 1 MUL FPU 2 Items: MUL C C C FPU C

David Sheldon, UC Riverside 16 of 22 Disjunctively Constrained Knapsack Base MicroBlaze Available FPGA Base MicroBlaze Knapsack MUL FPU 1 MUL FPU 2 Items: MUL C C C FPU C Weights: Benefits: Weights: Benefits 1: Benefits 2: MUL 1 MUL C Conjoined benefits shows a small decrease in benefit from the unconjoined unit Conjoined units provide benefits to both processors

David Sheldon, UC Riverside 17 of 22 Disjunctively Constrained Knapsack Running Time Modeling 5 Synthesis runs for each Processor At most 4 runs of the conjoined Simulator Disjunctively Constrained 0-1 Knapsack NP-complete problem Solved with a heuristic Heuristic takes < 1 min

David Sheldon, UC Riverside 18 of 22 Results Data gathered for the Xilinx Microblaze Soft- core Processor 10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk Obtained results for all possible pairwise conjoinment We only show conjoinment data when both applications use unit To avoid making conjoinment appear better than it is

David Sheldon, UC Riverside 19 of 22 Results Knapsack approach finds near-optimal in most cases

David Sheldon, UC Riverside 20 of 22 Results Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment) Runs in seconds One example had sub-optimal results (2.9 times slower) Performance overhead due to conjoinment just a few percent on average

David Sheldon, UC Riverside 21 of 22 Results On average the knapsack approach yields the same size as the exhaustive with conjoinment Average size savings of 16%

David Sheldon, UC Riverside 22 of 22 Conclusions Conjoining two soft-core FPGA processors reduces average size by 16% Performance overhead just a few percent in most cases Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases But could be improved for some examples Future Consider multiplexing size and delay overheads Apply Kumar's advanced conjoining techniques to reduce overheads