Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Parallell Processing Systems1 Chapter 4 Vector Processors.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

A Floating Point Divider for Complex Numbers in the NIOS II Presented by John-Marc Desmarais Authors: Philipp Digeser, Marco Tubolino, Martin Klemm, Daniel.

Wajid Minhass, Paul Pop, Jan Madsen Technical University of Denmark

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Programmable Hardware: Hardware or Software?

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

Contents Introduction Bus Power Model Related Works Motivation

Dynamo: A Runtime Codesign Environment

Architecture and Synthesis for Multi-Cycle Communication

A Study of Group-Tree Matching in Large Scale Group Communications

Improving java performance using Dynamic Method Migration on FPGAs

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Department of Electrical Engineering Joint work with Jiong Luo

Presentation transcript:

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004

Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions

Reconfigurable Processor Platform u Reconfigurable processor (RP) core + programmable fabric  RP core supports: Basic instruction set + customized instructions u Programmable fabric implements the customized instructions u Either runtime reconfigurable or pre-synthesized u Example: Nios / Nios II from Altera  Stratix version supported by Nios 3.0 system  5 extended instruction formats  Up to 2048 instructions for each format Reconfigurable Processor Core CPU Bus

Motivational Example t 1 = a * b; t 2 = b * 2; ; t 3 = c * 5; t 4 = t 1 + t 2 ; t 5 = t 2 + t 3 ; t 6 = t 5 + t 4 ; Execution time: 9 clock cycles *: 2 clock cycles+: 1 clock cycle Extended Instruction Set: I  extop1  expop2 extop1 extop2 *** abc t 1 = extop1(a, b, 2); t 2 = extop2(b, c, 2, 5); t 3 = t 1 + t 2 ; Execution time: 5 clock cycles Speedup: 1.8

Problem Statement Given:  Application program in CDFG G(V, E)  A processor with basic instruction set I  Pattern constraints: I. Number of inputs less than N in; II. 1 output; III. Total area no more than A Objective:  Generate a pattern library P  Map G to the extended instruction set I  P, so that the total execution time is minimized.

Proposed ASIP Compilation Flow u Extended Instruction Candidates Generation  Satisfying I/O constraints u Extended Instruction Selection  Select a subset to maximize the potential speedup while satisfying the resource constraint u Code Generation  Graph covering  Minimize the total execution time Instruction Implementation / Pattern Generation / ASIP constraints ASIP Synthesis Pattern Selection Application Mapping Pattern Library C Implementation Mapped CDFG Compilation CDFG Simulation

Step 1. Pattern Enumeration 3-feasible cones: n 1 : {a, b} n 2 : {b, 2} n 3 : {c, 5} n 4 : {n 1, n 2 },{n 1, b, 2},{n 2, a, b},{a, b, 2} *** abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Each pattern is a N in - feasible cone Cut enumeration is used to enumerate all the N in - feasible cones [cong et al, FPGA’99] Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not N in -feasible

Step 2. Pattern Selection u Basic idea: simultaneously consider speed up, occurrence frequency and area. u Speedup Tsw(p) = total execution time with basic instructions Thw(p) = length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) u Occurrence  Some pattern instances may be isomorphic  Graph isomorphism test [ Nauty Package ]  Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p)  Occurrence(p)  Selection under area constraint can be formulated as a 0-1 knapsack problem Pattern *+ T sw = 3 T hw = 2 Speedup = 1.5 *** abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6

Step 3. Application Mapping u Assume execution on an in-order, single-issue processor u Cover each node in G(V, E) with the extended instruction set to minimize the execution time.  Trivial pattern – software execution time  Nontrivial pattern – hardware execution time  Total execution time = Sum of execution time of instance patterns after application mapping u Theorem : The application mapping problem is equivalent to the library-based minimum-area technology mapping problem.

Speedup and Resource Overhead on NIOS # Extended Instruction Speedup Resource Overhead EstimationNiosLEMemory DSP Block fft_br %65, %16 iir %4, %40 fir %1, %8 pr %00.00%14 dir %00.00%16 mcm %00.00%56 Average %-1.77%-

Simulation Environment u Simplescalar v3.0 u Benchmarks  From Mediabench suite u Machine Configuration  Single issue in-order processor (ARM like)  DL1: 8KB, 4-way, 1 cycle  IL1: 8KB, direct mapped, 1 cycle  Unified L2: 256KB, 4-way, 8 cycle  Functional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMult  Reconfigurable units critical path latency of the collapsed instructions critical path latency of the collapsed instructions

Pattern Distribution Most of the patterns have less than 7 nodes inside

Ideal Speedup under Different Input Size Constraints

Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions

Register File Bandwidth Problem u Most of the speedup comes from clusters with more than two inputs u 2-port register file in embedded processors u Need extra cycles to transfer data for extended instructions with more than 2 inputs u Speedup drop due to communication overhead

Speedup Drop with Different Input Constraints  Move operation takes one cycle   46% speedup drop on average

Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions

Architecture Extensions u Existing Solutions  Dedicated Data Link Avoid potential resource contention through bus Avoid potential resource contention through bus Need extra cycles for communication Need extra cycles for communication Employed in Microblaze from Xilinx Employed in Microblaze from Xilinx  Multiport Register File Low utilization when executing basic instructions Low utilization when executing basic instructions Area and power grows cubically Area and power grows cubically  Register File Replication Predetermined one-to-one correspondence Predetermined one-to-one correspondence Resource waste in terms of area and power Resource waste in terms of area and power Limit compiler optimization Limit compiler optimization

Our Approach – Shadow Registers u Core registers are augmented by an extra set of shadow registers  Conditionally written  Used only by the custom logic

Shadow Registers u Controlling the shadow register u Advantages and limitations  Cost-efficient for small number of shadow registers  Only need a few control signals to be added  Opportunity for compiler optimization  Require extra control bits OperationForward the resultSkip Instruction Subword Shadow- reg ID 012-

Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions

i 1 = …; i 2 = ext 1 (…, i 1, …); i 3 = …; i 4 = ext 2 (…, i 1, …); i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); Internal Representation 2-level CDFG representation  1 st level: control flow graph  2 nd level: data flow graph  Computation node latency & scheduled time slot  Data edge lifetime  Variable lifetime e3e3 e4e4 e2e2 e1e Life time e 1 = [2, 2] Life time e 2 = [2, 4] Life time i 1 = [2, 4]

Observation Observation u 2-port register file u 3-input extended instruction u Without shadow register 4 additional moves u Binding for 1 register i 1 = …; i 2 = ext 1 (…, i 1, …); i 3 = …; i 4 = ext 2 (…, i 1, …); i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); e3e3 e4e4 e2e2 e1e Binding 1: either i 1 or i 3 in shadow register save 2 moves Binding 2: save 3 moves

Register Binding u Which operands should be bound?  Each input could be a candidate  Binding different candidates leads to different savings  Unaffordable to try all the combinations

One Shadow Register Binding Problem u Problem formulation:  Given A scheduled DFG and one shadow register  Objective Bind variables to shadow register Minimize the number of moves

Algorithm for Binding One Shadow Register u Weighted compatibility graph Vertex data edge in the DFG Vertex data edge in the DFG Weight # saves if the value is kept in the register Weight # saves if the value is kept in the register Edge lifetimes don’t overlap Edge lifetimes don’t overlap u Theorem:  Binding problem is equivalent to find a maximum weighted chain in the compatibility graph  Can be optimally solved in time O(|V’| + |E’|) u Extension to K-shadow registers

Experimental Results (1) Speedup under different number of shadow registers for 3-input extended instructions

Experimental Results (2) Speedup under different number of shadow registers for 4-input extended instructions

Conclusions u Proposed and developed complete compilation flow u Observed and quantitatively analyzed data bandwidth problem u Proposed novel architecture extension and efficient register binding algorithm

Thank You