1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science.

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

Morgan Kaufmann Publishers The Processor
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
C Chuen-Liang Chen, NTUCS&IE / 321 OPTIMIZATION Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University.
Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date :
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
1/14 A Result Forwarding Unit for a Synthesisable Asynchronous Processor Luis Tarazona and Doug Edwards Advanced Processor Technologies Group School of.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Chapter 12 Pipelining Strategies Performance Hazards.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Viterbi Decoder: Presentation #2 Omar Ahmad Prateek Goenka Saim Qidwai Lingyan Sun M1 Overall Project Objective: Design of a high speed Viterbi Decoder.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Viterbi Decoder: Presentation #1 Omar Ahmad Prateek Goenka Saim Qidwai Lingyan Sun M1 Overall Project Objective: Design of a high speed Viterbi Decoder.
1 ACS Unit for a Viterbi Decoder Garrick Ng, Audelio Serrato, Ichang Wu, Wen-Jiun Yong Advisor: Professor David Parent EE166, Spring 2005.
Sequential Testing Two choices n Make all flip-flops observable by putting them into a scan chain and using scan latches o Becomes combinational testing.
1 Computer Organization Today: First Hour: Computer Organization –Section 11.3 of Katz’s Textbook –In-class Activity #1 Second Hour: Test Review.
IT253: Computer Organization Lecture 10: Making a Processor: Control Signals Tonga Institute of Higher Education.
Von Neumann Machine Objectives: Explain Von Neumann architecture:  Memory –Organization –Decoding memory addresses, MAR & MDR  ALU and Control Unit –Executing.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Computer Science 210 Computer Organization Control Circuits Decoders and Multiplexers.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.
Introduction to Computer Organization Pipelining.
Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Computer Organization
ARM Organization and Implementation
Computer Science 210 Computer Organization
More Devices: Control (Making Choices)
Morgan Kaufmann Publishers
What is this “Viterbi Decoding”
Morgan Kaufmann Publishers The Processor
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Pipelined Architectures for High-Speed and Area-Efficient Viterbi Decoders Chen, Chao-Nan Chu, Hsi-Cheng.
Improving Program Efficiency by Packing Instructions Into Registers
CDA 3101 Spring 2016 Introduction to Computer Organization
Design of the Control Unit for Single-Cycle Instruction Execution
Computer Science 210 Computer Organization
Morgan Kaufmann Publishers The Processor
Pipelining and Vector Processing
Design of the Control Unit for One-cycle Instruction Execution
CS 201 Compiler Construction
Coe818 Advanced Computer Architecture
Instruction Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Performance-oriented Peephole Optimisation of Balsa Dual-Rail Circuits
ECE 352 Digital System Fundamentals
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Morgan Kaufmann Publishers The Processor
A Level Computer Science Topic 5: Computer Architecture and Assembly
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science

2/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”

3/26 Viterbi Decoder BMU description BMU – Branch Metric Unit Algorithm: Read inputs a,c Calculate: b=7-a, d=7-c Calculate ac=a+c, ad=a+d bd=b+d, bc=b+c Normalise by subtracting Min(ac,ad,bd,bc) from ac,ad,bd,bc Output {ac,ad,bd,bc} Can be pipelined Can be unrolled

4/26 Viterbi Decoder BMU HC Data + control Control

5/26 Optimised BMU description pipeline & unroll operations Pipelined multicast

6/26 Optimised BMU HC Data + Control Control Small scattered control trees Pipelined “data driven” structure

7/26 Viterbi Decoder PMU Description PMU – Path Metric Unit Find global winner Normalise state Algorithm: Let MemState be an array of values, then Global winner = the unique MemState[i] which is equal to zero, otherwise no GW found Normalise values by subtracting Min(all MemState) from every MemState[i]

8/26 PMU HC Data + control Control

9/26 Optimised PMU description

10/26 Optimised PMU HC Data + control Control Small scattered control trees

11/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”

12/26 Eliminating redundant FVs Targets active input control Single access, single read-port FalseVariable or eagerFalseVariable HCs i -> then CMD end

13/26 Eliminating redundant FVs - Example a, b -> then o <- a + b end

14/26 Eliminating redundant FVs - Example Latency and area reduction Preserves external behaviour a, b -> then o <- a + b end

15/26 Concurrent RTZ Fetch Wires-only dual-rail Fetch and its STG

16/26 Concurrent RTZ Fetch New concurrent RTZ Fetch and its STG

17/26 The ParSeq Acts conditionally as a Concur (parallel) or as a Sequencer HC Few opportunities to apply it in the design examples available –Perhaps caused by its non-existence at that time? Interesting increase in performance, though.

18/26 Handshake circuit implementation ||

19/26 Optimised ParSeq Schematics

20/26 Some Simulation Results Pre-layout, transistor-level simulations, 180nm technology

21/26 Write-after-read Loop unrolling WAR hazards prevents the use of T-element based Seq

22/26 Write-after-read Loop unrolling First-read-unroll to reorder operations inside loop and allow the use of a T-element based sequencer

23/26 Write-after-read Loop unrolling Automatic unrolling if unbounded WAR loop detected

24/26 Write-after-read Loop unrolling Performance gain depends on data width Upper-bound performance gains (transistor-level simulation results, 180 nm) In real examples (32-bit processor units) speed-ups around 4-6% for 32-bit data widths

25/26 CaseFetch with “default” “default” signal decides whether the selected input or a default value is sent to the output To override possibly spurious data (i.e. when reading speculatively while writing a variable/channel) if using DI encoding

26/26 Conclusions and Future Work Conclusions: Coding style heavily influences performance and control optimisation opportunities –Large speed-ups at low area cost (2x-20x /10-25% area overhead) –Pipeline-style descriptions normally have small control trees which leave less room for control resynthesis –Datapath optimisations needed for increasing performance Presented optimisations & components increase performance of highly optimised code in ~10% Future work: To incorporate the optimisations into the Balsa design flow. ParSeq as a construct or as a peephole optimisation? Add performance-oriented coding style guide/examples to the Balsa manual