1/14 A Result Forwarding Unit for a Synthesisable Asynchronous Processor Luis Tarazona and Doug Edwards Advanced Processor Technologies Group School of.

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
Slide 1/20IWLS 2003, May 30Early Output Logic with Anti-Tokens Charlie Brej, Jim Garside APT Group Manchester University.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
Southampton: Oct 99Asynchronous Circuit Compilation- 1 AMULET3-H n Asynchronous macrocell ARM compatible processor core Full custom RAM Compiled ROM Balsa.
ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 1 Logic design of asynchronous circuits Part IV: Large Asynchronous Systems.
11-May-04 Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos) A Distributed Colouring Algorithm.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Goal: Reduce the Penalty of Control Hazards
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Pipelining By Toan Nguyen.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.
Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
EKT303/4 Superscalar vs Super-pipelined.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
CS203 – Advanced Computer Architecture ILP and Speculation.
Chapter Six.
Dynamic Scheduling Why go out of style?
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Module: Handling Exceptions
Tomasulo With Reorder buffer:
CMSC 611: Advanced Computer Architecture
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
A SEMINAR ON 64 BIT COMPUTING.
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Chapter Six.
Chapter Six.
High Performance Asynchronous Circuit Design and Application
Adapted from the slides of Prof
Performance-oriented Peephole Optimisation of Balsa Dual-Rail Circuits
Overview Prof. Eric Rotenberg
CSC3050 – Computer Architecture
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
CMSC 611: Advanced Computer Architecture
Presentation transcript:

1/14 A Result Forwarding Unit for a Synthesisable Asynchronous Processor Luis Tarazona and Doug Edwards Advanced Processor Technologies Group School of Computer Science

2/14 Result Forwarding Method to reduce inter-instruction data dependencies performance penalty Can even be used to allow out-of order execution. Hard to implement in asynchronous processors Earlier proposed solutions to resolve data dependencies in asynchronous processors: –Register locking (AMULET1) –Last-result register (AMULET2) –Asynchronous ROB (AMULET3) –Counterflow pipelines Full-custom solutions!

3/14 Potential Benefits

4/14 Synthesisable Result Forwarding Unit Synthesisable description advantages: –Faster development –Design-space exploration –Technology mapping transparency The description serves to: –Evaluate the capabilities of the Balsa language to describe performance-demanding systems –Highlight performance-oriented description techniques

5/14 The Target Processor: nanoSpa Experimental new SPA specification Same 3-stage SPA pipeline architecture Main target: Performance No support yet for –Thumb Instructions –Interrupts –Memory Aborts –Coprocessors

6/14 Related Work: AMULET3 ROB D.A. Gilbert & J.D. Garside 1997 Asynchronous Reorder Buffer that provides forwarding and precise exceptions handling Implemented in single-rail Five-process reference model for the synthesisable FU

7/14 nanoFU Architecture Parameterised queue sizes: 4,5,6 & 8 Dual-rail, performance-oriented description style

8/14 Implementation Issues Synchronisation between processes: –Use data tokens instead of sync channels to increase performance –Speculative buffer reads to decouple arrival and forwarding –Buffer cell locking to decouple Forwarding and Allocation –Drawbacks: power and area penalty

9/14 Implementation Issues CAM implementation based on comparators – relatively simple but still slow Register bank operation: –Potential hazards in dual-rail if speculatively reading while writing Register read must wait for Lookup to provide “default” forwarding value –Number of tokens in pipeline guarantees that writeout never conflicts with reading

10/14 Simulation Results Pre-layout, transistor-level simulations, 180nm technology

11/14 Balsa limitations highlights Need for: –Efficient ways of describing and synthesising associative arrays –Deadlock-safe implementation that allows concurrent writes and reads in variables (for speculative reading) –Signal-level manipulation to avoid excessive synchronisation Some peephole optimisations (next talk)

12/14 Conclusions

13/14 Future work To extend the nanoSpa pipeline by including a memory stage and evaluate the performance of the forwarding unit within this architecture To implement and explore the effects of suggested optimisations and components

14/14 Thank you very much! Questions? Acknowledgement Thanks to Luis Plana, Andrew, Charlie and Will for their suggestions and comments. This work and PhD are supported by EPSCR and UoM School of Computer Science scholarships.