Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Slides:



Advertisements
Similar presentations
Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.
Advertisements

Instructor: Yuzhuang Hu Final Exam! The final exam is scheduled on 7 th, August, Friday 7:00 pm – 10:00 pm.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Pipelined Datapath and Control
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
A Floating Point Divider for Complex Numbers in the NIOS II Presented by John-Marc Desmarais Authors: Philipp Digeser, Marco Tubolino, Martin Klemm, Daniel.
Processor Types And Instruction sets Chapter- 5.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
PipeliningPipelining Computer Architecture (Fall 2006)
Lecture 5. MIPS Processor Design Pipelined MIPS #1 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212 Computer Architecture.
CS 352H: Computer Systems Architecture
Computer Organization
CDA3101 Recitation Section 8
ECE354 Embedded Systems Introduction C Andras Moritz.
Pipelining Chapter 6.
Design-Space Exploration
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Pipeline Implementation (4.6)
CDA 3101 Spring 2016 Introduction to Computer Organization
Processor Pipelining Yasser Mohammad.
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab
Lecture 6: Advanced Pipelines
TigerSHARC processor General Overview.
Pipelining Chapter 6.
Pipelining in more detail
CSC 4250 Computer Architectures
CSCI206 - Computer Organization & Programming
The Processor Lecture 3.6: Control Hazards
* From AMD 1996 Publication #18522 Revision E
Instruction Execution Cycle
Pipelining.
CSC3050 – Computer Architecture
Introduction to Computer Organization and Architecture
Pipelining Chapter 6.
The University of Adelaide, School of Computer Science
Guest Lecturer: Justin Hsia
Presentation transcript:

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar, liuhb, tulika}@comp.nus.edu.sg Presented by Alex Oumantsev

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Introduce the material Related Work Proposed Architecture Compilation Toolchain Experimental Evaluation Conclusion

Application-Specific instruction-set extensions (Custom Instructions) Extend the instruction-set architecture Balance performance and time-to-market Frequently used computation patterns Custom Functional Units Parallelization and chaining of operations Processor Support – RISC-style Altera Nios-II Tensilica Xtensa

Base Processor – Custom Instruction mismatch RISC-style Fixed-length instructions Two input operations per instruction Custom Instructions Complex Multiple inputs per operation

Number of Inputs per Custom Instruction

Data Forwarding Present on a typical RISC processor Register Bypassing Supplies data to a Functional Unit from buffer Resolves Data hazards between instructions Input operands for Custom Instruction Use existing Logic

Related Work Design Space Exploration Data Bandwidth Nios-II Internal Register Files Extra cycles wasted on explicit MOV MicroBalaze Xilinx : Fast Simplex Link put and get instructions Relaxing register file port constraints Fixed length instruction problem

Proposed Architecture MIPS-like 5 stage pipeline

Data Forwarding CUST instruction draws 2 inputs from Forwarding Able to take up to 4 inputs Modification – Do not read from Register in ID if Forwarding

Instruction Encoding Transparent to regular instructions Minimize number of bits for operands NIOS-II Example Use 11 bits of OPX field OPD defines operands from forwarding COP specifies the custom instruction

Predictable Forwarding Two prior instructions can be used Problems with Multicycle and Cache Miss Create bubbles in the pipeline Can’t rely on forwarding Modify to send Stall signal to all stages Pauses the pipeline till ready No need for NOP instruction

Multicycle Delays

Cache Miss Delays

Compilation Toolchain Compiler cooperation needed Determine if operand can be forwarded Encode custom instruction correctly Schedule to maximize forwarding

Compilation Toolchain IR Scheduling Pattern Identification Identify all possible patterns for custom instructions Pattern Selection Heuristic pattern Priority=speedup * frequency Instruction Scheduling Find optimal scheduling with forwarding Forwarding Check and MOV Insertion Insert MOV from x reg to x reg if needed

Experimental Evaluation SimpleScalar tool set used Constraint of max 4 inputs and one output Selected benchmarks

Speedup Speedup = (CycleOrigin / CycleEx -1)*100 Ideal – 4 Read Ports from Registers Forwarding – Discussed solution (may have MOV) MOV – Nios-II implemented solution (forces MOV)

Energy Consumption Energy used by Registers Ideal – 4 Read Ports from Registers Forwarding – Discussed solution (may have MOV) MOV – Nios-II implemented solution (forces MOV)

Conclusion Compiler modification Minor pipeline modification Data Forwarding used for MISO custom instructions Overcome limited register ports Compatible instruction encoding Near-ideal speedup