Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>
ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI

ROUTER ARCHITECTURE Router: Implements: Pipelined
Registers Switches Functional Units Control Logic Implements: Routing & Flow Control Pipelined Use credits for buffer space Flits  <downstream> Credits  <upstream> Constitute the credit loop 10/21/2019

ROUTER Diagram Virtual Channel Router Datapath: Control:
Input Units | Switch | Output Units Control: Router, VC allocator, Switch Allocator Input Unit: State Vector ( for each VC) & Flit Buffer (for each C) State vector fields: GROPC >>>> Output Unit: Latches outgoing flits State vector (GIC) >>>>>>>>> Switch: Connect I/p to o/p according to SA VCA: Arbitrate o/p channel RQs from each I/p packet Once for each packet!! SA: Arbirates o/p port RQs from I/p ports Done for each flit Router: Determines o/p ports for packets 10/21/2019

VC State Fields x(# of VCs) x(# of VCs) Input virtual channel:
G  Global State I, R, V, A, C R  Route O/p port for packet O  o/p VC O/p VC of port R for packet P  Pointers Flit head and tail pointers C  Credit Count # of credits for o/p VC R.O Output virtual channel: I, A, C I  I/p VC I/p port.VC forwarding this o/p VC C  Credit count # of free buffers at the downstream x(# of VCs) x(# of VCs) 10/21/2019

How it works 1)Packet  Input controller   Route Determined
Router  o/p port (I.e. P3) VCA  o/p VC (I.e. P3.VC2)  Route Determined 2)Each flit  input controller  SA  timeslot over Switch Flit forwarded to o/p unit 3)Each flit  output unit  Drive downstream physical channel Flit Transferred 10/21/2019

Router Pipeline RC VA SA ST TX Route Compute: VC Allocate:
Define the o/p port for packet header VC Allocate: Assign a VC from the port if available Switch Allocate: Schedule switch state according to o/p port requests Switch Traverse: I/p drives the switch for o/p port Transmit: Transmit the flit over downstream channel RC VA SA ST TX RC, VA  Only for header O/p channel is assigned for whole packet SA, ST, TX  for all flits Flits from different packets compete continuously Flits Transmitted sequentially for routing in next hops 10/21/2019

Pipeline Walkthrough (0):<start> (1):<RC> (2):<VA>
P4.VC3: (I/p VC) G=I | R=x | O=x | P=x | C=x Packet Arrives at I/p port P4 Packet header  VC3  Packet stored in P4.VC3 (1):<RC> P4.VC3: G=R | R=x | O=x | P=<head>,<tail??> | C=x Packet Header  Router  select o/p port: P3 (2):<VA> G=V | R=P3| O=x | P=<head>,<tail??> | C=x P3.VC2: (o/p VC) G=I | I=x | C=x P3  VCA  Allocate VC for o/p port P3: VC2 10/21/2019

…Pipeline Walkthrough
(3):<SA> P4.VC3: (i/p VC) G=A | R=P3| O=VC2 | P=<head>,<tail??> | C=# P3.VC2: (o/p VC) G=A | I=P4.VC3 | C=# Packet Processing complete Flit by flit switch allocation/traverse & Transmit Head flit allocated on switch  Move pointers Decrement P4.VC3.Credit Send a credit to upstream node to declare the available buffer space (4):<ST> Head flit arrives at output VC (5):<TX> Head flit transmitted to downstream (6):<Tail in SA> Packet done (7):<Release Resources> G=I or R (if new packet already waiting) | R=x| O= x | P= x | C= x G=I | I=x | C=x 10/21/2019

Pipeline Stalls P1 P2, P3, F1 F2 F3 Packet Stalls: Flit Stalls:
P1) I/p VC Busy stall P2) Routing stall P3) VC Allocation stall Flit Stalls: F1) Switch Allocation Stall F2) Buffer empty stall F3) Credit stall Credit Return cycles: pipeline(4)+RndTrip(4)+CT(1)+CU(1)+NextSA(1)=11 P1 P2, P3, F1 Credit return cycles: SA ST: # credit – ST | W1 | W2 | RC | VA | SA  new credit from Downstream CT | W1 | W2  credit reaches upstream CU |  credit ++ SA  new switch allocation F2 F3 10/21/2019

Channel Reallocation 1) Conservative
Wait until credit received for tail from downstream to reallocate o/p VC 2) Aggressive – single Global state Reallocate o/p VC when tail passes SA (same as VA stall) Reallocate downstream I/p VC when tail passes SA (Same as I/p VC busy stall) 10/21/2019

…Channel Reallocation
2) Aggressive – Double Global state Reallocate o/p VC when tail passes SA (same as VA stall) Eliminate I/p VC busy stall Needs 2 I/p VC state vectors at downstream: For A: G=A | R=Px | O=VCx | P=<head A> <tail A> | C=# For B: G=R | R=x | O=x | P=<head B> <tail??> | C=x 10/21/2019

Speculation and Lookahead
Reduce latency by reducing pipe stages  Speculation (and lookahead) Speculate virtual channel allocation: Do VA and SA concurrently If VC set from RC spans more than 1 port speculate that as well Successful Unsuccessful Lookahead: Do route compute for node I at node I-1 Start at VA at each node; overlap NRC & VA 10/21/2019

Flit and Credit Format Two ways to distinguish credits/flits:
Piggybacking Credit: Include a credit field on each flit No types required Define types: I.e. 10 start credit, 11  start flit, 0x  idle Flit Format: Credit Format: Head Flit VC Type (Credit) Route info Payload CRC Body Flit VC Type (Credit) Payload CRC Credit VC Type Check 10/21/2019

ROUTER COMPONENTS Datapath: Control: Input buffer Switch Output unit
Hold waiting flits Switch Route flits from I/p  o/p Output unit Send flits to downstream Control: Arbiter Grant access to shared resources Allocator Allocate VCs to packets and switch time to flits 10/21/2019

Input Buffer Smoothes down flit traffic Hold flits awaiting:
VCs Switch BW or Channel BW Organization: Centralized Partitioned into physical channels Partitioned into VCs 10/21/2019

Centralized Input Buffer
Combined single memory across entire router No separate switch, but Need to multiplex I/ps to memory Need to demultiplex memory o/p to o/p ports PRO: Flexibility in allocating memory space CONs: High Memory BW requirement 2xI (write I I/ps read I o/ps per flit time) Flit deserialization / reserialization latency Need to get I flits from VCs before writing to MEM I : node degree 10/21/2019

Partitioned Input Buffers
1 buffer per physical I/p port: Each Memory BW: 2 (1 read, 1 write) Buffers shared across VCs for a fixed port Buffers not shared across ports Less flexibility 1 buffer per VC: Enable switch I/p speedup Obviously, bigger switch Too fine granularity Inefficient mem usage Intermediate solutions: Mem[ even VC] Mem[ odd VC] Memory 10/21/2019

Input Buffer Data Structures
Data structures required to: Track flit/ packet locations in Memory Manage available free memory Allocate multiple VCs Prevent blocking Two common types: Circular buffers Static, simpler yet inefficient mem usage Linked Lists Dynamic, complex, but fairer mem usage Nomenclature: Buffer (flit buffer): entire structure Cell (flit cell): storage for a single flit 10/21/2019

Circular Buffer FIXED! First and Last ptrs
Specify the memory boundary for a VC Head and Tail specify current content boundary Flit added from tail Tail incremented (modular) Tail = Head  Buffer Full Flit removed from head Head incremented (modular) Head = Tail  Buffer empty Choose size N power of 2 so that LSB log(N) bits do the circular increment I.e. like cache line index & byte offset a,b,c,d removed g,h,i,j added 10/21/2019

Linked List Buffer Each cell has a ptr field for next cell
Head and Tail specify 1st and last cells NULL for empty buffers Free List: Linked list of free cells Free points to head of list Counter registers Count of allocated cells for each buffer Count of cells in free list Bit errors have more severe effect compared to circular buffer Add cell e Remove cell a 10/21/2019

Buffer Memory Allocation
Prevent greedy VC to flood all memory and block! Add a count register to each I/p VC state vector Keep number of allocated cells Additional counter for free list Simple Policy: Reserve 1 cell for each VC Add flit to bufferVCi if: bufferVCi empty or #(free list) > #(empty VCs) Detailed policy: Sliding Limit Allocator (r: # reserved cells per buffer, f: fraction of empty space to use) Add flit to bufferVCi if: |bufferVCi|<r or r<|bufferVCi|<f.#(free list) + r f=r=1  same as simple policy 10/21/2019

SWITCH Core: directs packets/flits to their destination
Speedup: provided switch BW / Min. required switch BW for full thruput on all I/ps and o/ps of router Adding speedup simplifies allocation and reveals higher thruput and lower latency Realizations: Bus switch Crossbar Network switch 10/21/2019

Bus Switches Switches in time
Input port accumulates P phits of a flit, arbirates for bus, transmits P phits over the bus to any o/p unit I.e. P=3  <fig. 17.5: P=4> Feasible only if flits have # phits > P (preferably int x P) Fragmentation Loss: If phits per flit not multiple of P P: number of I/p switch ports 10/21/2019

Bus timing diagram Could actually start here! 10/21/2019
P: number of I/p switch ports 10/21/2019

Bus Pros & Cons Simple switch allocation Wasted port BW
I/p port owning bus can access all o/p ports Multicast made easy Wasted port BW Port BW: b  Router BW=Pb  Bus BW=Pb  I/p deserializer BW=Pb  o/p serializer BW=Pb  Available internal BW: PxPb=P2b Used bus BW: Pb (speedup = 1) Increased Latency 2P worst case <see 17.6-bus timing diagram> Can vary from P+1 to 2P (phit times) P: number of I/p switch ports 10/21/2019

Xbar Switches 2 1 3 4 Primary issue: speedup
1. kxk  no speedup - fig 17.10(a) 2. skxk  I/p speedup=s – fig 17.10(b) 3. kxsk  o/p speedup=s – fig 17.11(a) 4. skxsk  speedup=s – fig 17.11(b) (Speedup simplifies allocation) 1 2 3 4 10/21/2019

Xbar Throughput O/p speedup: Overall speedup (both I/p & o/p)
Ex: Random separable allocator, I/p speedup=s, uniform traffic: Thruput: =P{at least one of the sk flits are destined for given o/p} =1-P{none of the sk I/ps choose given o/p}=1-[(k-1)/k]sk  Thruput =1-[(k-1)/k]sk s=k  thruput=100% (doesn’t verify as above!!) O/p speedup: Need to implement reverse allocator More complicated for same gain Overall speedup (both I/p & o/p) Can achieve > 100% thruput Cannot sustain since: o/p buffer will expand to inf. and I/p buffers need to be initially filled with inf. # of flits I/p speedup: si & o/p speedup: so (si>so) Similar to I/p speedup=(si/so), with overall speedup so  Thruput: 10/21/2019

Network Switches A network of smaller switches
Reduces # of crosspoints Localize logic Reduces wire length Requires complex control or intermediate buffering Not very profitable! Ex: 7x7 switch as 3 3x3 switches 3x9=27 switches instead of 7x7=49 10/21/2019

OUTPUT UNIT Essentially a FIFO to match switch speed
If switch o/p speedup=1: merely latch the flits to downstream No need to partition across VCs Provide backpressure to SA to prevent buffer overflow SA should block traffic to the choking o/p buffer 10/21/2019

ARBITER not holding holding 1 cycle grant 4 cycle grant holding grant
Resolve multiple requests for a single source (N1) Building blocks for allocators (N1N2) Communication and timing: not holding holding 1 cycle grant 4 cycle grant holding grant 10/21/2019

Arbiter Types fixed 1 gets resource variable Types:
Fixed Priority: r0> r1> r2>… Variable (iterative) Priority: rotate priorities Make a carry chain, hot 1 inserted from priority inputs I.e. r1 > r2 > …>r0  (p0,p1,p2,…,pn)=010…0 Matrix: implements a LRS scheme Uses a triangular array M(r,c)=1  RQr > RQc Queueing: First come, first served <The bank/STA Travel style> Ticket counter: Gives current ticket to requester Increments with each ticket Served counter: Stores current served requester's number Increments for next customer 1 gets resource LRS: least recently served variable 10/21/2019

ALLOCATOR maximal maximum Provides matching: nxm allocator
Multiple Resources  Multiple Requesters I.e. switch allocator: Every cycle match I/p ports  o/p ports 1 flit per I/p port 1 flit goes to each o/p port nxm allocator rij: requester i wants access to resource j gij: requester i granted access to resource j Request & Grant Matrices: Allocation rules gij => rij: Grant if requested gij => No other gik: Only 1 grant for each requester I/p gij => No other gkj: Only 1 grant for each resource o/p maximal maximum 10/21/2019

Allocation Problem Can be represented as finding the maximum matching grant matrix Also a maximum matching in a bipartite graph: Exact algorithms: Augmenting path method Always finds maximum matching Not feasible in time budget Faster Heuristics: Separable allocators: 2 sets of arbiration: Across I/ps & across o/ps In either order: I/p first OR o/p first R G2 10/21/2019

4x3 Input-first Separable Allocator
10/21/2019

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Similar presentations

Presentation on theme: "Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Similar presentations

Presentation on theme: "Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>"— Presentation transcript:

Similar presentations

About project

Feedback