1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece E-mail: nivas@physics.auth.gr Aristotle University of Thessaloniki

2 Outline Introduction Target Architecture Overview Partial Predicated Execution Enhancement Virtual Opcode Enhancement Development Framework Experimental Results Conclusions

3 Introduction Characteristics of modern embedded applications Diversity of algorithms Rapid evolution of standards High performance demands To amortize cost over high production volumes embedded systems must: Exhibit high levels of flexibility => fast Time-to-Market Exhibit high levels of adaptability => increased reusability An appealing option => couple a reconfigurable hardware (RH) to a typical processor Processor => bulk of the flexibility RH => adaptation to the target application Support by a development framework that hides RH related issues Maintain flexibility Continue to target software-oriented group of users

4 Target Architecture Reconfigurable Instruction Set Processor (RISP) Core processor 32-bit single issue RISC 5 pipeline stages Reconfigurable Functional Unit (RFU) 1-D array of coarse-grain processing elements (PEs) An interface that tightly couples the RFU to the core Explicit communication

5 Target Architecture - ISA Re=‘0’ => Standard Instruction Set Flexibility to execute any program Re=‘1’ => Reconfigurable Instruction Set Extensions Offers the adaptation to the target application Three types of Reconfigurable Instructions Complex computational operations Complex addressing modes Complex control flow operations 32-Bit Instruction Word Format

6 Target Architecture - RFU 1-D Array of coarse-grain PEs Executes Reconfigurable Instructions Multiple-Input-Single-Output (MISO) clusters of primitive operations Un-registered output Chain of operations in the same clock cycle Registered output Chain of pipelined operations Floating PEs => Can operate in both core pipeline stages on demand Better utilization of the available hardware

7 Target Architecture – Configuration Local configuration memory Multi-context No overhead to select a context Array of coarse-grain PEs => Small number of configuration bit-stream per instruction

8 Target Architecture – Synthesis Results A hardware model (VHDL) was designed Synthesis results with STM 0.13um Reasonable area overhead No overhead to core critical path Configuration Value Granularity 32-bits (16x16Multiplier) Number of Processing Elements 8 Processing Elements Functionality ALU, Shifter, Multiplier Configuration Contexts 16 words of 134 bits Local Memory Size 8 constants of 32-bits Number of Provided Local Operands 4 ComponentArea (mm 2 ) Processor Core0.134 RFU Processing Layer0.186 RFU Interconnection Layer0.125 RFU Configuration Layer0.137 RFU Total0.448

9 Enhancement with Partial Predicated Execution Predication Eliminate branches from an instruction stream Conditional execution of an instruction Utilized to expose Instruction Level Parallelism Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement Large clusters of operations => increased performance

10 Support of Partial Predicated Execution The available output network can be utilized Extensions Two configuration bits Two multiplexers Hardwired connections to PEs Selection of the RFU output Controlled by configuration bits => no predication Controlled by comparison result => predicated execution Comparison => implemented in a PE

11 Enhancement with Virtual Opcode Explicitly communication between Core and RFU Opcode explosion problem Proposed solution => “Virtual” opcode Virtual opcode = Natural opcode + code region Overhead => Configuration memory size Coarse grain => Small configuration size => 136 bits/per instr. In general Virtual opcode can performed by flushing and reload the whole local memory Large performance overhead Applicable for different applications

12 Support of Virtual Opcode Local Configuration memory => extended with extra level of contexts First level = K contexts of locally available reconfigurable instructions Second level = L copies of the first level for different code regions For each code region only one of L contexts is active The same natural opcode in different region context forms a virtual opcode Partitioning of regions and issue of activation performed by the compiler One cycle overhead to activate a context Configuration memory size = K*L*Conf. Bits per Instr.

13 Development Framework Automated framework for the development of applications in the architecture Transparent incorporation of the reconfigurable instructions set extensions Based on the SUIF/MachSUIF compiler infrastructure

14 Dev. Framework – Front End / Profiling Application source code translated in CDFG (SUIFvm operations) Perform machine independent optimizations If-conversion for partial predicated execution can be applied CDFG instrumented with profiling annotations translated to equivalent C code compiled and executed in the host Profiling information are collected Regions execution frequency

15 Dev. Framework – Instruction Generation First step = Pattern Generation In-house tool for the identification of MISO cluster of operations based on the MaxMISO algorithm Second step = Mapping of MISO in the RFU 1. Place the SUIFvm nodes in PEs / Route the 1-D array 2. Analyze paths and set the output of a PE (reg./unreg.) to minimize delay 3. Report candidate instruction semantics Candidate2 Candidate1 PE1PE2 PE3 Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4 { region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2 }

16 Dev. Framework – Instruction Selection (1/2) No Virtual opcode Consider the whole application space Perform pair-wise graph isomorphism to identify identical candidate instructions Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency Rank candidate instructions based on dynamic gain Select best L instructions L defined by the number of supported instructions

17 Dev. Framework – Instruction Selection (2/2) With Virtual opcode enabled Partition application code into regions Currently supporting only procedures Perform Graph isomorphism per region Calculate dynamic gain offered by each candidate for each region Calculate overhead to set active the region contexts Rank regions and candidate instructions based on dynamic gain Select best K regions and best L instructions from each region L, K defined by the supported contexts and instructions per context

18 Experimental Results Prove the performance improvements offered by the proposed architecture Evaluate the efficiency of the enhancements A complete MPEG-2 encoding application is used Source code from MediaBench benchmark suite Input data => a video sequence consisting of 12 frames with resolution of 144x176 pixels

19 Exp. Results – SpeedUp Analysis Speedup analysis for the most timing consuming functions of MPEG2 enc. Accelerate only critical regions => small overall speedup (Amdahl) Our approach accelerates the whole application’s space => overall speedup is preserved Instr. Count (10 6 ) (No RFU) SpeedUp (Incremental) SAD589.06.61.5 dist11206.03.42.3 fullsearch73.52.02.5 bdist118.02.02.5 putbits16.32.32.6 fdct15.62.32.6 quant13.12.62.7 idctcol11.42.42.7 dct10.42.32.7 pred_comp10.11.92.7 iquant9.91.82.8 add_pred8.02.02.8 bdist27.31.82.8 idctrow7.02.22.8 putnonintrablk6.91.82.8 sub_pred6.61.82.9 Overall1448.72.9

20 Exp. Results – Evaluation of predication Example of four instructions derived using if conversion and partial predicated execution These instructions implement the SAD function Significant performance improvements are offered SAD Speedup Overall Speedup No predic.1.7 Predic.6.62.9

21 Exp. Results – Evaluation of Virtual Opcode Virtual opcode can be used to preserve speedups for architectures with limited opcode space Reasonable overhead for the local configuration memory size Finer partitioning of regions could result to more impressive results Memory Organization (inst.Xcont.) SpeedupMemory Size (KB) 4x81.70.5 8x82.01.1 16x122.83.2 32x123.08.7 Unconstr.3.1-

22 Conclusions Two enhancements to a previously proposed RISP architecture have been proposed Partial predicated execution => increase performance Virtual opcode => relaxes opcode space pressure An automated development framework have been presented Hides the reconfigurable hardware from the user Supports the two enhancements The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application Future research Support full predication for further performance improvements Support finer partitioning of regions for better utilization of virtual opcode

23 Thank you !!! Questions ??

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

Similar presentations

Presentation on theme: "1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

Similar presentations

Presentation on theme: "1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and."— Presentation transcript:

Similar presentations

About project

Feedback