From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Slides:



Advertisements
Similar presentations
Computer Architecture Instruction-Level Parallel Processors
Advertisements

CSCI 4717/5717 Computer Architecture
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
UNIVERSITY OF MASSACHUSETTS Dept
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Study of AES Encryption/Decription Optimizations Nathan Windels.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Chapter 5 Basic Processing Unit
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Lecture 30 Read Only Memory (ROM)
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Processor Architecture
Pipelining and Parallelism Mark Staveley
ICC Module 3 Lesson 1 – Computer Architecture 1 / 26 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 1 – Assembler.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Instruction-Level Parallelism and Its Dynamic Exploitation
A Closer Look at Instruction Set Architectures
Multiscalar Processors
/ Computer Architecture and Design
SECTIONS 1-7 By Astha Chawla
UNIVERSITY OF MASSACHUSETTS Dept
CS203 – Advanced Computer Architecture
Advanced Computer Architecture
ECE 352 Digital System Fundamentals
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn

Disclaimer The approach presented here comes from a paper of the same title. Written by Sami Yehia and Oliver Temam of Paris XI University. Presented at the 31 st Annual International Symposium on Computer Architecture (ISCA’04) While this presentation is my own work, the methodologies, experimental results, and graphs come from this paper.

Outline Background Instruction collapsing Potential performance improvements Limitations of the approach Implementation Improvements to the approach Summary

Background Current processor trends are heavily reliant on pipelining and ILP exploitation On chip space devoted to these techniques, rather than to physical computing resources (ie FU’s) Better improvements rely on software and hardware co-exploitation, but STILL look at ILP Propose a new approach that exploits circuit level parallelism, rather than instruction level parallelism

Instruction collapsing Take a sequential set of dependent instructions – ILP exploitation useless Could implement as a Function – Can be collapsed to a combinational 2 level sum of products (OR’s of AND’s) or LUT

Instruction collapsing Exploit CLP at the cost of redundant operations Cost: 2 64 bit inputs → bit truth table!!! One solution: implement as a set of n 1 bit operators with multiple carry propagation

Progress Background Instruction collapsing Potential performance improvements Limitations of the approach Implementation Improvements to the approach Summary

Potential performance improvement Potential speedup determined by the number of collapsible dependent instructions – Need to identify all disjoint DFG’s in the program trace Speed up is the average height of all DFG’s – Avg of 1.5 for instruction traces with 1024 window

Limitations of the approach Number of physical inputs (register + carries) – Hardware operator size fixed Load instructions – Cannot be combined with dependent instructions – Still semi collapsible – Avg 24.4% of instructions Non collapsible instructions – Eg syscalls, FP divide – Avg 15% of instructions Result: only consider integer add/sub, constant shift, bit operations/manipulations and conditional branches

Limitations of the approach Significant bit carries Height limitation – Consider only those DFG’s with height greater than the Function unit latency – Allows better utilisation of all FU’s

Progress Background Instruction collapsing Potential performance improvements Limitations of the approach Implementation Improvements to the approach Summary

Implementation 4 main components – DFGT : Data Flow Graph Table – POT : Producing Output Table – FGE : Function Generation Engine – FRT : Function Repository Table

Implementation Output flag set to indicate which instruction is an output of the DFGT Use the POT to keep track of data dependencies – Each entry has an index into the DFGT to the instruction that produces the result for that register – The combination of the POT and DFGT is similar to that of the ROB, except that it is done offline Once an instruction is loaded into the DFGT, compose this operation with the functions producing its source operands, thus creating a more complex function

Implementation FGE (Function Generation Engine) – Three types of inputs – If the operand is a result of a previous instruction, send the function producing this operand as a truth table FRT (Function Repository Table) – Stores the result of each function as a 64 bit truth table (6 inputs for each function) – One truth table for EACH bit of the input word – Each entry in the DFGT contains an index to the corresponding function results in the FRT – Also stores the number of inputs of the truth table

r9r10CiFCo

Implementation How does the FGE create a new truth table from the previous ones? – For each combination of the inputs, it looks up the truth table of the operands – Uses the result to look up the truth table of the operation itself (this is stored in an additional library of operations) – The library also indicates if additional variables (ie carries) must be introduced The final function truth table is stored back into the FRT, and linked through the DFGT again

Hardware implementation From truth table to reconfigurable Function Unit Function unit advantages – Combinational logic only – Single row – No complex interconnections Disadvantages – Significant number of inputs → large logic blocks

Hardware implementation Major issue – Overhead of dynamically building DFG’s and functions on the fly – Assembling large traces – Trace → DFG → Function truth table → Macro rePLay framework – Not going into details – Speed up of branch resolution – Effect of Function delay (reconfig and process)

Progress Background Instruction collapsing Potential performance improvements Limitations of the approach Implementation Improvements to the approach Summary

Improvements to the approach Increase the number of inputs? Increase trace window for frames? Alleviate load cuts through address prediction? Combine Functions with existing ILP techniques – Best of both worlds

Summary Exploits circuit level parallelism Collapse dependent instructions into 2 level combinational circuits Works independently of ILP – Targets a different set of instructions