1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Lecture 3: MIPS Instruction Set

Dynamo: A Runtime Codesign Environment

Microarchitecture.

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Improving java performance using Dynamic Method Migration on FPGAs

Methodology of a Compiler that Compresses Code using Echo Instructions

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Presentation transcript:

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, Thessaloniki, Greece Aristotle University of Thessaloniki

2 Outline Introduction Target Architecture Overview Partial Predicated Execution Enhancement Virtual Opcode Enhancement Development Framework Experimental Results Conclusions

3 Introduction Characteristics of modern embedded applications Diversity of algorithms Rapid evolution of standards High performance demands To amortize cost over high production volumes embedded systems must: Exhibit high levels of flexibility => fast Time-to-Market Exhibit high levels of adaptability => increased reusability An appealing option => couple a reconfigurable hardware (RH) to a typical processor Processor => bulk of the flexibility RH => adaptation to the target application Support by a development framework that hides RH related issues Maintain flexibility Continue to target software-oriented group of users

4 Target Architecture Reconfigurable Instruction Set Processor (RISP) Core processor 32-bit single issue RISC 5 pipeline stages Reconfigurable Functional Unit (RFU) 1-D array of coarse-grain processing elements (PEs) An interface that tightly couples the RFU to the core Explicit communication

5 Target Architecture - ISA Re=‘0’ => Standard Instruction Set Flexibility to execute any program Re=‘1’ => Reconfigurable Instruction Set Extensions Offers the adaptation to the target application Three types of Reconfigurable Instructions Complex computational operations Complex addressing modes Complex control flow operations 32-Bit Instruction Word Format

6 Target Architecture - RFU 1-D Array of coarse-grain PEs Executes Reconfigurable Instructions Multiple-Input-Single-Output (MISO) clusters of primitive operations Un-registered output Chain of operations in the same clock cycle Registered output Chain of pipelined operations Floating PEs => Can operate in both core pipeline stages on demand Better utilization of the available hardware

7 Target Architecture – Configuration Local configuration memory Multi-context No overhead to select a context Array of coarse-grain PEs => Small number of configuration bit-stream per instruction

8 Target Architecture – Synthesis Results A hardware model (VHDL) was designed Synthesis results with STM 0.13um Reasonable area overhead No overhead to core critical path Configuration Value Granularity 32-bits (16x16Multiplier) Number of Processing Elements 8 Processing Elements Functionality ALU, Shifter, Multiplier Configuration Contexts 16 words of 134 bits Local Memory Size 8 constants of 32-bits Number of Provided Local Operands 4 ComponentArea (mm 2 ) Processor Core0.134 RFU Processing Layer0.186 RFU Interconnection Layer0.125 RFU Configuration Layer0.137 RFU Total0.448

9 Enhancement with Partial Predicated Execution Predication Eliminate branches from an instruction stream Conditional execution of an instruction Utilized to expose Instruction Level Parallelism Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement Large clusters of operations => increased performance

10 Support of Partial Predicated Execution The available output network can be utilized Extensions Two configuration bits Two multiplexers Hardwired connections to PEs Selection of the RFU output Controlled by configuration bits => no predication Controlled by comparison result => predicated execution Comparison => implemented in a PE

11 Enhancement with Virtual Opcode Explicitly communication between Core and RFU Opcode explosion problem Proposed solution => “Virtual” opcode Virtual opcode = Natural opcode + code region Overhead => Configuration memory size Coarse grain => Small configuration size => 136 bits/per instr. In general Virtual opcode can performed by flushing and reload the whole local memory Large performance overhead Applicable for different applications

12 Support of Virtual Opcode Local Configuration memory => extended with extra level of contexts First level = K contexts of locally available reconfigurable instructions Second level = L copies of the first level for different code regions For each code region only one of L contexts is active The same natural opcode in different region context forms a virtual opcode Partitioning of regions and issue of activation performed by the compiler One cycle overhead to activate a context Configuration memory size = K*L*Conf. Bits per Instr.

13 Development Framework Automated framework for the development of applications in the architecture Transparent incorporation of the reconfigurable instructions set extensions Based on the SUIF/MachSUIF compiler infrastructure

14 Dev. Framework – Front End / Profiling Application source code translated in CDFG (SUIFvm operations) Perform machine independent optimizations If-conversion for partial predicated execution can be applied CDFG instrumented with profiling annotations translated to equivalent C code compiled and executed in the host Profiling information are collected Regions execution frequency

15 Dev. Framework – Instruction Generation First step = Pattern Generation In-house tool for the identification of MISO cluster of operations based on the MaxMISO algorithm Second step = Mapping of MISO in the RFU 1. Place the SUIFvm nodes in PEs / Route the 1-D array 2. Analyze paths and set the output of a PE (reg./unreg.) to minimize delay 3. Report candidate instruction semantics Candidate2 Candidate1 PE1PE2 PE3 Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4 { region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2 }

16 Dev. Framework – Instruction Selection (1/2) No Virtual opcode Consider the whole application space Perform pair-wise graph isomorphism to identify identical candidate instructions Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency Rank candidate instructions based on dynamic gain Select best L instructions L defined by the number of supported instructions

17 Dev. Framework – Instruction Selection (2/2) With Virtual opcode enabled Partition application code into regions Currently supporting only procedures Perform Graph isomorphism per region Calculate dynamic gain offered by each candidate for each region Calculate overhead to set active the region contexts Rank regions and candidate instructions based on dynamic gain Select best K regions and best L instructions from each region L, K defined by the supported contexts and instructions per context

18 Experimental Results Prove the performance improvements offered by the proposed architecture Evaluate the efficiency of the enhancements A complete MPEG-2 encoding application is used Source code from MediaBench benchmark suite Input data => a video sequence consisting of 12 frames with resolution of 144x176 pixels

19 Exp. Results – SpeedUp Analysis Speedup analysis for the most timing consuming functions of MPEG2 enc. Accelerate only critical regions => small overall speedup (Amdahl) Our approach accelerates the whole application’s space => overall speedup is preserved Instr. Count (10 6 ) (No RFU) SpeedUp (Incremental) SAD dist fullsearch bdist putbits fdct quant idctcol dct pred_comp iquant add_pred bdist idctrow putnonintrablk sub_pred Overall

20 Exp. Results – Evaluation of predication Example of four instructions derived using if conversion and partial predicated execution These instructions implement the SAD function Significant performance improvements are offered SAD Speedup Overall Speedup No predic.1.7 Predic

21 Exp. Results – Evaluation of Virtual Opcode Virtual opcode can be used to preserve speedups for architectures with limited opcode space Reasonable overhead for the local configuration memory size Finer partitioning of regions could result to more impressive results Memory Organization (inst.Xcont.) SpeedupMemory Size (KB) 4x x x x Unconstr.3.1-

22 Conclusions Two enhancements to a previously proposed RISP architecture have been proposed Partial predicated execution => increase performance Virtual opcode => relaxes opcode space pressure An automated development framework have been presented Hides the reconfigurable hardware from the user Supports the two enhancements The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application Future research Support full predication for further performance improvements Support finer partitioning of regions for better utilization of virtual opcode

23 Thank you !!! Questions ??