Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Computer Architecture Instruction-Level Parallel Processors

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

The University of Adelaide, School of Computer Science

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

Parallell Processing Systems1 Chapter 4 Vector Processors.

Computer Organization and Architecture

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Processor Technology and Architecture

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Multiscalar processors

Chapter 12 CPU Structure and Function. Example Register Organizations.

DLX Instruction Format

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

Chapter 5 Basic Processing Unit

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.

1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Pipelining and Parallelism Mark Staveley

Chapter One Introduction to Pipelined Processors

Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,

Improving Program Efficiency by Packing Instructions Into Registers

COMP4211 : Advance Computer Architecture

Computer Organization “Central” Processing Unit (CPU)

Detailed Analysis of MiBench benchmark suite

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Topic 5: Processor Architecture Implementation Methodology

Topic 5: Processor Architecture

Presentation transcript:

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji Inoue †, and Maziar Goudarzi † † Kyushu University, Japan ‡ AmirKabir University of Technology, Iran

FranceKyushu University Outline  Introduction  Overview of ADEXOR Architecture  Generating Multi-Exit Custom Instructions (MECIs)  Proposing an Architecture for the CRFU  Experimental Results  Conclusions and Future Work

FranceKyushu University Introduction  Motivation Increase in manufacturing and NRE costs Increase in design and verification costs due to more complexity Shorter time-to-market More required flexibility due to evolution of standards, user requirements, supporting multiple applications and etc  Our proposed approach Generating custom instruction after chip-fabrication Proposing multi-exit custom instructions Proposing a reconfigurable functional unit with conditional execution using a quantitative approach

FranceKyushu University General Overview of ADEXOR Architecture

FranceKyushu University General Overview of ADEXOR Architecture  Phases Design phase Configuration phase Normal phase

FranceKyushu University Generating Multi-Exit Custom Instructions  Multi-Exit Custom Instructions Include hot directions of branch instructions Include single entry but multiple exits

FranceKyushu University Generating Multi-Exit Custom Instructions  Finding the largest sequence of instructions in the CFG  Checking the anti-dependence and flow-dependence and moving executable instructions to the head and tail(s) of MECIs  Rewriting the object code where instructions are going to be moved. Detecting subgraphs in a HIS branch exit 1 LS exit 2 S1 branch exit 1 mtc1 branch exit 1 Overwriting the entry node with mtc1 instr Moving instructions after checking data dependency (binary rewriting) exit 2 S2 S1 S2 S1 LS

FranceKyushu University Generating Multi-Exit Custom Instructions  Tool Chain

FranceKyushu University CRFU Architecture: A Quantitative Approach  22 programs of MiBench were chosen  Simplescalar toolset was utilized for simulation  CRFU is a matrix of FUs No of Inputs No of Outputs No of FUs Connections Location of Inputs & Outputs  Some definitions: Considering frequency and weight in measurement  CI Execution Frequency  Weight (To equal number of executed instructions)  Average = for all CIs (ΣFreq*Weight) Rejection rate: Percentage of MECIs that could not be mapped on the CRFU Mapping rate: Percentage of MECIs that could be mapped on the CRFU

FranceKyushu University Inputs/Outputs

FranceKyushu University Functional Units

FranceKyushu University Width/Depth

FranceKyushu University CRFU Architecture

FranceKyushu University Supporting Conditional Execution Selector-Mux

FranceKyushu University Synthesis result  Synopsys tools  Hitachi 0.18 μm  Area: 2.1 mm 2  Configuration bits: 615 bits  Delay Depth of DFG of MECI Delay (ns)

FranceKyushu University Experiment setup  22 applications of Mibench  Simplescalr Issue4-way L1- I cache32K, 2 way, 1 cycle latency L1- D cache32K, 4 way, 1 cycle latency Unified L21M, 6 cycle latency Execution units4 integer, 4 floating point RUU size & Fetch queue size64 Branch predictorbimodal Branch prediction table size2048 Extra branch misprediction latency3

FranceKyushu University Speedup CIs & MECIs

FranceKyushu University Effect of clock frequency of speedup

FranceKyushu University Conclusion  Our experimental results show that by extending custom instructions over multiple HBBs the average speedup increases by 46% compared to the custom instructions which are limited to only one HBB. This is achieved in return for 83% more hardware and 20% more configuration bits. Utilizing connections with different length are helpful for supporting larger custom instructions with the available number of FUs.

FranceKyushu University Future work  Energy evaluation of the ADEXOR  Exploring the design space of CRFU architecture (To study the effect of number of inputs, outputs, FU on the speedup, area and power)

FranceKyushu University Thank you for your attention

FranceKyushu University Introduction (1/2)  Efficiency and flexibility in embedded system design Both are critical Both are conflicting design goals Custom hardwired feature for more efficiency (performance and energy) Programmable feature for more flexibility  Design challenges for future embedded systems More required efficiency (performance and energy) for future embedded systems Higher manufacturing and NRE cost of new nanometer-scale technologies Higher cost and risk in development Shorter time-to-market

FranceKyushu University Introduction (2/2)  Custom instructions Effective technique for improving efficiency Custom functional units are required (manufacturing, NRE and design cost)  Our proposed approach Generating custom instruction after chip- fabrication Proposing multi-exit custom instructions Replacing custom functional units with a reconfigurable functional unit with conditional execution

FranceKyushu University Generating Multi-Exit Custom Instructions  Motivating example adpcm loop B1 S1 B2 S2 B3 B4 S3 B5 S4 B6 S5 J1 B7 S6 J2 B8 B9 S7 B10 S8 S9 B11 J3 B12 S10

FranceKyushu University Generating Multi-Exit Custom Instructions  Generating Hot Instruction Sequence 1. MECIs should not cross loop boundaries 2. Sorting loop from innermost to outermost 3. Reading HBBs and linking them to generate Hot Instruction Sequence 4. Sort other remaining HBBs in ascending order considering their start address Function MAKE_HIS (objfile, HIS, start_addr) 1 if (HBB with start_addr is not included in previous MECIs) then read_add_HBB2HIS (objfile, HBB(start_addr), HIS) else return; 2 switch last_instruction(HBB) 3 case (indirect jump, return or call): return; 4 case (direct jump): MAKE_HIS(objfile, HIS, target address of jump); 5 case (branch): 5-1 if (it is hot backward) then return; 5-2 elsif (not-taken direction is hot) then MAKE_HIS(objfile, HIS, target address of not-taken direction) else return; 5-3 if (taken direction is hot) then MAKE_HIS(objfile, HIS, target address of taken direction) else return; 6 default: return;

FranceKyushu University Generating Multi-Exit Custom Instructions  MECIs include fixed point instructions except multiply, divide and load. At most on store and five branches.  A MECI can have at most four exit points branch with only one hot direction indirect jump and return call hot backward branch an instruction where its next instruction is non- executable.  If both directions of a branch are hot, both corresponding HBBs are added.

FranceKyushu University Executing MECIs on the CRFU

FranceKyushu University Proposing an Architecture for the CRFU  Design Methodology

FranceKyushu University Mapping Tool

FranceKyushu University Integrated Framework (1/2)  Integrated Framework Performs an integrated temporal partitioning and mapping process Takes rejected CIs as input Partitions them to appropriate mappable CIs Adds nodes to the current partition while architectural constraints are satisfied The ASAP level of nodes represents their order to execute according to their dependencies

FranceKyushu University Integrated Framework (2/2)  Incremental HTTP The node with the highest ASAP level is selected and moved to the subsequent partition.  Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

FranceKyushu University Supporting Conditional Execution

FranceKyushu University Execution time of configuration phase Application Exec. time (Seconds) Application Exec. time (Seconds) adpcm225gsm461 bitcounts331 lame526 blowfish94 patricia84 basicmath34 qsort233 cjpeg75 rijndael68 crc132 sha29 dijkstra101 stringsearch3 djpeg9 susan122 fft36Average150.8

FranceKyushu University Effect of connections on mapping rate  By deleting the connections with length more than one, 24.2% of MECIs can not be mapped.

FranceKyushu University General Overview of ADEXOR Architecture  Main components Base processor  4-issue in-order RISC processor Reconfigurable functional units (CRFU)  Coarse grain  Based on matrix of functional units (FUs)  Multi-cycle  Parallel with other functional units of the base processor  Read/write from/to register file  Functions and connections are controlled by configuration bits Configuration memory  To keep the configuration data of CRFU for multi-exit custom instructions (MECIs) Counters  Control read/write ports of register file and select between CRFU and processor functional units

FranceKyushu University Speedup CIs & MECIs  The number of inputs, outputs and FUs are the same  simpler connections and FUs and does not support conditional execution.  Area: 1.15 mm 2  Delay for a CI with a critical length of five is 7.66 ns.  Each CI configuration needs 512 bits.  The average number of instructions included in CIs (one HBB) is 6.39 instructions and for MECIs is 7.85 instructions.

FranceKyushu University MECIs vs. CI

FranceKyushu University Distribution of functions row1row2ro3row4row5 and23110 Or12111 Xor21110 Nor10000 Add/sub54321 Shift43221 Compare22221

FranceKyushu University Control Bits & Immediate Data  375 bits are needed as Control Bits for Multiplexers Functional Units  240 bits are needed for Immediates  Each CI configuration needs ( = 615 bits)