Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington Seattle, WA Fine Grain Incremental Rescheduling Via Architectural.

Slides:



Advertisements
Similar presentations
1 General-Purpose Languages, High-Level Synthesis John Sanguinetti High-Level Modeling.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Chapter 4 Retiming.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Instruction-Level Parallelism (ILP)
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
© Kavita Bala, Computer Science, Cornell University Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipelining See: P&H Chapter 4.5.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
CPEN Digital System Design Chapter 10 – Instruction SET Architecture (ISA) © Logic and Computer Design Fundamentals, 4 rd Ed., Mano Prentice Hall.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
Superscalar SMIPS Processor Andy Wright Leslie Maldonado.
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Implementing and Optimizing a Direct Digital Frequency Synthesizer on FPGA Jung Seob LEE Xiangning YANG.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
Branch Hazards and Static Branch Prediction Techniques
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Automatic Pipelining during Sequential Logic Synthesis Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona Joint work with Marc Galceran-Oms.
03/30/031 ECE Digital System Design & Synthesis Lecture Design Partitioning for Synthesis Strategies  Partition for design reuse  Keep related.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Yuxi Liu The Chinese University of Hong Kong Circuit Timing Problem Driven Optimization.
Real-World Pipelines Idea Divide process into independent stages
Elementary Microarchitecture Algebra
ECE 565 High-Level Synthesis—An Introduction
Pipelining and Retiming 1
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
Half-Price Architecture
Pipelined Implementation : Part I
Guerilla Section #4 10/05 SDS & MIPS Datapath
From C to Elastic Circuits
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Pipelined Implementation : Part I
Dynamically Scheduled High-level Synthesis
Jun Chen and Changbo Long
Pipelined Implementation : Part I
Adapted from the slides of Prof
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
High Performance Asynchronous Circuit Design and Application
ICS 252 Introduction to Computer Design
October 29 Review for 2nd Exam Ask Questions! 4/26/2019
Presentation transcript:

Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington Seattle, WA Fine Grain Incremental Rescheduling Via Architectural Retiming

RAM Offset Example Problem -- Clock period is too large Write Address Read Address

RAM Write Address Read Address Offset Pipelining Problems w/ consecutive dependent operations

Performance Bottleneck Latency constrained paths Latency constrained paths Latency = n

Performance Bottleneck Latency constrained paths Latency constrained paths Latency = n Approach Approach apply architectural retiming at the RT level

Problem: too much work, too little time Architectural Retiming ykyk

Problem: too much work, too little time D pipelineregister ykyk Architectural Retiming

N negative register Problem: too much work, too little time pipelineregister D C ykyk Architectural Retiming

N negative register Problem: too much work, too little time pipelineregister D C ykyk Architectural Retiming precomputation prediction

Outline Precomputation Precomputation incremental rescheduling without resource constraints incremental rescheduling without resource constraints Prediction Prediction incremental rescheduling with resource constraints incremental rescheduling with resource constraints Results Results

D t = C t+1 Precomputation Function h h h D C xixi f f g g ykyk x´x´ i N

D t = C t+1 = f (..., x i t+1,... ) = f (..., x i t+1,... ) Precomputation Function h h h D C xixi f f g g ykyk x´x´ i N

D t = C t+1 = f (..., x i t+1,... ) = f (..., x i t+1,... ) x i t+1 = x´ i t = g (..., y k t,... ) Precomputation Function h h h D C xixi f f g g ykyk x´x´ i N

f´ D t = C t+1 = f (..., x i t+1,... ) = f (..., x i t+1,... ) x i t+1 = x´ i t = g (..., y k t,... ) Precomputation Function h h h D C xixi f f g g ykyk x´x´ i N D t = f (..., g (..., y k t,... ),...) = f´(..., y k t,... ) = f´(..., y k t,... )

Incremental Rescheduling h h h f f g g ykyk Time n g Time n+1 f, h N

f´ Incremental Rescheduling h h h f f g g ykyk Time n g Time n+1 f, h N Time n f ’ Time n+1 h

Precomputing With Register Arrays Read Data Write Address Read Address Write Data Read Data

Precomputing With Register Arrays Write Address Read Address Write Data Read Data Out N F

Precomputing With Register Arrays F t = Out t+1 Write Address Read Address Write Data Read Data Out N F

Precomputing With Register Arrays F t = Out t+1 = Array t+1 [Read Address t+1 ] Write Address Read Address Write Data Read Data Out N F

Synthesizing Bypass Paths Write Address Precomputed Read Address Write Data Read Data = ? Write Address Read Address Write Data Read Data

Precomputing RAM Output RAM N

Prediction D C f f gigi Z N What if ? What if ? can’t precompute, can’t precompute, too many additional resources, or too many additional resources, or performance is unsatisfactory performance is unsatisfactory

Prediction D C f f gigi Z N What if ? What if ? can’t precompute, can’t precompute, too many additional resources, or too many additional resources, or performance is unsatisfactory performance is unsatisfactory Predict C one cycle before its arrival Predict C one cycle before its arrival

Schedule with Mispredictions C H R1R2 t-1 t t+1 C c1c2 H h1h2 

Schedule with Mispredictions C H R1R2 t-1 t t+1 C c1 H  Verify Negative Register c2 h1h2

Schedule with Mispredictions C H R1R2 t-1 t t+1 C c1 H  Verify Negative Register

Schedule with Mispredictions C H R1R2 t-1 t t+1 C c1 H  h1 c1*=? c1 c1* Verify Negative Register c2* c2 h2 c2*=? c2 c2

Synthesis Issues in Prediction Negative register as predicting FSM Negative register as predicting FSM use signal transition probabilities use signal transition probabilities incorporate don’t care conditions incorporate don’t care conditions Nullifying mispredictions Nullifying mispredictions Two correction strategies Two correction strategies As-Soon-As-Possible restoration As-Late-As-Possible correction Add handshaking signals to coordinate with interface Add handshaking signals to coordinate with interface

Related Work Precomputation Precomputation Bypass Synthesis Bypass Synthesis lookahead [Kogge ‘81, …..] lookahead [Kogge ‘81, …..] Prediction / Speculative Execution Prediction / Speculative Execution Most likely path, arbitrarily deep [Holtmann & Ernst ‘93,’95] Most likely path, arbitrarily deep [Holtmann & Ernst ‘93,’95] Pre-execution [Radivojevic & Brewer ‘94] Pre-execution [Radivojevic & Brewer ‘94] Possible multiple paths & arbitrarily deep [Lakshminarayana et al. ‘98] Possible multiple paths & arbitrarily deep [Lakshminarayana et al. ‘98] Percolation scheduling [Potasman et al. ‘90] Percolation scheduling [Potasman et al. ‘90]

Results

Architectural Retiming Improves throughput while preserving functionality and sometimes latency Improves throughput while preserving functionality and sometimes latency Bridge gap between HLS and logic optimizations Bridge gap between HLS and logic optimizations Unifies several sequential optimizations Unifies several sequential optimizations bypass synthesis bypass synthesis lookahead transformation lookahead transformation branch prediction branch prediction fine-grain cross register optimizations fine-grain cross register optimizations

Ph.D. Forum at DAC ‘99 Goal Goal increase interaction between academia and industry increase interaction between academia and industry Format Format students present work at poster session at DAC students present work at poster session at DAC researchers give feedback researchers give feedback Who’s eligible? Who’s eligible? Students within 1 or 2 years of finishing Ph.D. thesis Students within 1 or 2 years of finishing Ph.D. thesis

The End

Precomputing in Single-Register Cycles Original Circuit BA

Precomputing in Single-Register Cycles Original Circuit N BA

Precomputing in Single-Register Cycles Lookahead -- A(n) is a function of B(n-2) N BA A' BA B' [Kogge, ‘81], [Parhi & Messerschmidtt, ‘89]

Precomputing RAM Output RAM

Precomputing RAM Output RAM

Speculative Execution c1 c2 c3 c4 c5 c6 Scope and Depth

Speculative Execution Scope and Depth