Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Slides:



Advertisements
Similar presentations
Control Unit Implemntation
Advertisements

Lecture 4: CPU Performance
CMPT 334 Computer Organization
Introduction to VLSI Programming TU/e course 2IN30 Lecture 3: Control Handshake Circuits (2)
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Pipelined Datapath and Control (Lecture #13) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Computer Architecture Pipelines Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson.
Introduction to VLSI Programming Lecture 6: Resource sharing (course 2IN30) Prof. dr. ir.Kees van Berkel.
DLX Instruction Format
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Introduction to Silicon Programming in the Tangram/Haste language Material adapted from lectures by: Prof.dr.ir Kees van Berkel [Dr. Johan Lukkien] [Dr.ir.
Pipelining Basics Assembly line concept An instruction is executed in multiple steps Multiple instructions overlap in execution A step in a pipeline is.
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
Lecture 7: Pipelining Review Kai Bu
Lecture 5: Pipelining Implementation Kai Bu
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Electrical and Computer Engineering University of Cyprus
CS161 – Design and Architecture of Computer Systems
Lecture 18: Pipelining I.
Exceptions Another form of control hazard Could be caused by…
Computer Organization CS224
Pipelines An overview of pipelining
Morgan Kaufmann Publishers
Lecture: Pipelining Basics
ELEN 468 Advanced Logic Design
CMSC 611: Advanced Computer Architecture
Pipelining.
Lecture: Pipelining Extensions
Morgan Kaufmann Publishers The Processor
CS/COE0447 Computer Organization & Assembly Language
School of Computing and Informatics Arizona State University
Pipelining.
Pipelining: Advanced ILP
CSCE 212 Chapter 5 The Processor: Datapath and Control
Introduction to VLSI Programming Lecture 9: High Performance DLX
The processor: Pipelining and Branching
Computer Organization “Central” Processing Unit (CPU)
Introduction to VLSI Programming Lecture 7: Introduction to the DLX
Serial versus Pipelined Execution
Introduction toVLSI Programming Lecture 4: Data handshake circuits
Pipelining in more detail
CSC 4250 Computer Architectures
MIPS Processor.
An Introduction to pipelining
Lecture: Pipelining Extensions
Lecture: Pipelining Extensions
Computer Architecture
Introduction to VLSI Programming Lecture 7: Introduction to the DLX
Pipelining: Basic Concepts
Pipelining.
Pipelining Appendix A and Chapter 3.
Morgan Kaufmann Publishers The Processor
Introduction to Computer Organization and Architecture
Lecture 06: Pipelining Implementation
Lecture 5: Pipeline Wrap-up, Static ILP
Guest Lecturer: Justin Hsia
Introduction to VLSI Programming Lecture 5: Tangram & Tools
Pipelining.
MIPS Processor.
Pipelined datapath and control
Presentation transcript:

Introduction to VLSI Programming Lecture 8: High Performance (DLX) (course 2IN30) Prof. dr. ir.Kees van Berkel Dr. Johan Lukkien

Time table 2005 date class | lab subject Aug. 30 2 | 0 hours intro; VLSI Sep. 6 3 | 0 hours handshake circuits Sep. 13 handshake circuits assignment Sep. 20 Tangram Sep. 27 no lecture Oct. 4 Oct. 11 1 | 2 hours demo, fifos, registers | deadline assignment Oct. 18 design cases; Oct. 25 DLX introduction Nov. 1 low-cost DLX Nov. 8 high-speed DLX Dec. 13 deadline final report 12/31/2018 Kees van Berkel

Lecture 8 Outline: Recapitulation of Lecture 7 VLSI programming for high performance: parallelism: expressions, commands, loops, pipelining pipelining the DLX Lab work: improve performance of Tangram DLX by introducing pipelining 12/31/2018 Kees van Berkel

DLX instruction formats 31 26, 25 21, 20 16, 15 11, 10 0 Opcode Reg-reg ALU operations rs1 rd rs2 function R-type Opcode loads, stores, conditional branch, .. rs1 rd Immediate I-type offset Opcode Jump, jump and link, trap, return from exception J-type 12/31/2018 Kees van Berkel

Example instructions 12/31/2018 Kees van Berkel

DLX interface, state Instruction memory Mem (Data memory) address r0 pc r1 r2 DLX CPU Reg instruction data r/w r31 clock interrupt 12/31/2018 Kees van Berkel

VLSI programming for … Low costs: introduce resource sharing. Low delay (high throughput): introduce parallelism. Low energy (low power): reduce activity; … 12/31/2018 Kees van Berkel

VLSI programming for high performance Keep it simple!! Make the analysis; focus on bottlenecks Introduce parallelism: expressions, commands, loops, pipelining Enable parallelism, by reducing dependencies such as resource sharing 12/31/2018 Kees van Berkel

Expression-level parallelism Examples: balancing: (v+w)+(x+y) is faster than v+w+x+y substitution: z:=g(f(x)) is faster than y:= f(x) ; z:= g(y) carry-select adder carry-save multiplier 12/31/2018 Kees van Berkel

Command level parallelism If S2 does not depend on outcome of S1 then S1 ; S2 can be transformed into S1 || S2. (dependencies: data, sharing, synchronization) This reduces computation time , unless ordering is enforced through external synchronization. (S1 ; S2 ) = (;) + (S1) + (S2) (S1 || S2 ) =  (||) + max((S1), (S2)) 12/31/2018 Kees van Berkel

Exposure of cmd-level parallelism Let *[S] be a shorthand for forever do S od Assume S0 must precede S1 and S1 must precede S2; How to speedup *[ S0 ; S1 ; S2 ] ? *[ S0 ; S1 ; S2 ] = { loop unfolding } S0 ; *[S1 ; S2 ; S0 ] = { S0 does not depend on S1} S0 ; *[S1 ; (S2 || S0) ] 12/31/2018 Kees van Berkel

wagging *[a?x ; b!f(x)] = { loop unrolling, renaming } *[a?x ; b!f(x) ; a?y ; b!f(y) ] = { loop folding } a?x ; *[b!f(x) ; a?y ; b!f(y) ; a?x]  {increases slack by 1} a?x ; *[(b!f(x) || a?y) ; (b!f(y) || a?x)] 12/31/2018 Kees van Berkel

Parallel reads from REG file Let RF be a register file. Then x:= RF[i] ; y:= RF[j] cannot be parallelized. (Register files have a single read port.) Parallel read actions can be realized by doubling the register file: << RF[i] , RG[i] >> := << z , z >> { write } and << x , y >> := << RF[i] , RG[j] >> { read } 12/31/2018 Kees van Berkel

Pipelining in Tangram Compare three programs: P0: *[ a?x0 ; b!f2(f1(f0(x0))) ] P1: *[ a?x0; x1:= f0(x0) ; x2:= f1(x1) ; b!f2(x2) ] P2: *[ a?x0 ; a1!f0(x0) ] || *[ a1?x1 ; a2!f1(x1) ] || *[ a2?x2 ; b!f2(x2) ] 12/31/2018 Kees van Berkel

Pipelining in Tangram (cntd) Output sequence b identical for P0, P1, and P2. P0 and P1 have same communication behavior; P1 is larger, slower, and warmer. P2 vs P1: similar in size, energy, and latency, but up to 3 times higher throughput, depending on (relative) complexity of f0, f1, f2. 12/31/2018 Kees van Berkel

DLX: 5-step sequential execution IF ID EX MM WB Reg A B Imm ir npc pc aluo cond lmd 0? Instr. mem 4 Mem 12/31/2018 Kees van Berkel

DLX: pipelined execution Time  [in clock cycles] 1 2 3 4 5 6 7 8 ... IF ID EX MM WB Program execution  [instructions] 12/31/2018 Kees van Berkel

DLX: pipelined execution Instruction Fetch Inst.Decode EXecute Memory Write Back 4 0? pc Instr. mem Reg Mem 12/31/2018 Kees van Berkel

Lab work Assignment 5: Create a 2-stage pipelined dlx2.tg Throughput must exceed 5 MIPS (benchmark = GCD). Design a reduced-costs version dlx2s.tg Note: use of shared variables is not allowed. Let command S1 || S2 be part of your DLX. When S1 has write access to variable x, S2 may neither read nor write x (and vice versa). 12/31/2018 Kees van Berkel

Next week: lecture 9 Outline: Pipelining the DLX, using branch-delay slots. Lab work: Assignment 6 (3-stage DLX) 12/31/2018 Kees van Berkel

DLX system organization RAMaddr datatoRAM datafromRAM ROMaddr ROMdata dlx(…) system boundary rom(…) ram(…) files: RAMout RAMin system_dlx(…) file: gcd.bin 12/31/2018 Kees van Berkel

dlx0.ht #include types.ht & dlx0 : export proc ( ROMaddr!chan adtype & ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S30 & datafromRAM?chan S30 ) . begin … RF: ram array U5 of S30 end 12/31/2018 Kees van Berkel

system_dlx0.ht #include "dlx0.ht" & dlx0 : proc ( ROMaddr!chan adtype & ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S30 & datafromRAM?chan S30 ) . import & env_dlx4 : main proc ( & ROMfile? chan word & RAMinfile? chan S30 & RAMfile! chan S30 /* <<address,data>> */ ) . begin next slide end 12/31/2018 Kees van Berkel

system_dlx0.ht : main body begin & ROMaddr : chan adtype & ROMdata : chan word & RAMaddr : chan rwadtype & datatoRAM : chan S30 & datafromRAM: chan S30 … & ROMinterface : proc() . begin .. end & RAMinterface : proc() . begin .. end | initialise() ; ROMinterface() || RAMinterface() || dlx0( ROMaddr, ROMdata, RAMaddr, datatoRAM, datafromRAM ) end 12/31/2018 Kees van Berkel

script htcomp -B system_dlx0 htsim -limit 1000 system_dlx0 gcd.bin RAMin RAMout htview system_dlx0 12/31/2018 Kees van Berkel