Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19> ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI
ROUTER ARCHITECTURE Router: Implements: Pipelined Registers Switches Functional Units Control Logic Implements: Routing & Flow Control Pipelined Use credits for buffer space Flits <downstream> Credits <upstream> Constitute the credit loop 10/21/2019
ROUTER Diagram Virtual Channel Router Datapath: Control: Input Units | Switch | Output Units Control: Router, VC allocator, Switch Allocator Input Unit: State Vector ( for each VC) & Flit Buffer (for each C) State vector fields: GROPC >>>> Output Unit: Latches outgoing flits State vector (GIC) >>>>>>>>> Switch: Connect I/p to o/p according to SA VCA: Arbitrate o/p channel RQs from each I/p packet Once for each packet!! SA: Arbirates o/p port RQs from I/p ports Done for each flit Router: Determines o/p ports for packets 10/21/2019
VC State Fields x(# of VCs) x(# of VCs) Input virtual channel: G Global State I, R, V, A, C R Route O/p port for packet O o/p VC O/p VC of port R for packet P Pointers Flit head and tail pointers C Credit Count # of credits for o/p VC R.O Output virtual channel: I, A, C I I/p VC I/p port.VC forwarding this o/p VC C Credit count # of free buffers at the downstream x(# of VCs) x(# of VCs) 10/21/2019
How it works 1)Packet Input controller Route Determined Router o/p port (I.e. P3) VCA o/p VC (I.e. P3.VC2) Route Determined 2)Each flit input controller SA timeslot over Switch Flit forwarded to o/p unit 3)Each flit output unit Drive downstream physical channel Flit Transferred 10/21/2019
Router Pipeline RC VA SA ST TX Route Compute: VC Allocate: Define the o/p port for packet header VC Allocate: Assign a VC from the port if available Switch Allocate: Schedule switch state according to o/p port requests Switch Traverse: I/p drives the switch for o/p port Transmit: Transmit the flit over downstream channel RC VA SA ST TX RC, VA Only for header O/p channel is assigned for whole packet SA, ST, TX for all flits Flits from different packets compete continuously Flits Transmitted sequentially for routing in next hops 10/21/2019
Pipeline Walkthrough (0):<start> (1):<RC> (2):<VA> P4.VC3: (I/p VC) G=I | R=x | O=x | P=x | C=x Packet Arrives at I/p port P4 Packet header VC3 Packet stored in P4.VC3 (1):<RC> P4.VC3: G=R | R=x | O=x | P=<head>,<tail??> | C=x Packet Header Router select o/p port: P3 (2):<VA> G=V | R=P3| O=x | P=<head>,<tail??> | C=x P3.VC2: (o/p VC) G=I | I=x | C=x P3 VCA Allocate VC for o/p port P3: VC2 10/21/2019
…Pipeline Walkthrough (3):<SA> P4.VC3: (i/p VC) G=A | R=P3| O=VC2 | P=<head>,<tail??> | C=# P3.VC2: (o/p VC) G=A | I=P4.VC3 | C=# Packet Processing complete Flit by flit switch allocation/traverse & Transmit Head flit allocated on switch Move pointers Decrement P4.VC3.Credit Send a credit to upstream node to declare the available buffer space (4):<ST> Head flit arrives at output VC (5):<TX> Head flit transmitted to downstream (6):<Tail in SA> Packet done (7):<Release Resources> G=I or R (if new packet already waiting) | R=x| O= x | P= x | C= x G=I | I=x | C=x 10/21/2019
Pipeline Stalls P1 P2, P3, F1 F2 F3 Packet Stalls: Flit Stalls: P1) I/p VC Busy stall P2) Routing stall P3) VC Allocation stall Flit Stalls: F1) Switch Allocation Stall F2) Buffer empty stall F3) Credit stall Credit Return cycles: pipeline(4)+RndTrip(4)+CT(1)+CU(1)+NextSA(1)=11 P1 P2, P3, F1 Credit return cycles: SA ST: # credit – ST | W1 | W2 | RC | VA | SA new credit from Downstream CT | W1 | W2 credit reaches upstream CU | credit ++ SA new switch allocation F2 F3 10/21/2019
Channel Reallocation 1) Conservative Wait until credit received for tail from downstream to reallocate o/p VC 2) Aggressive – single Global state Reallocate o/p VC when tail passes SA (same as VA stall) Reallocate downstream I/p VC when tail passes SA (Same as I/p VC busy stall) 10/21/2019
…Channel Reallocation 2) Aggressive – Double Global state Reallocate o/p VC when tail passes SA (same as VA stall) Eliminate I/p VC busy stall Needs 2 I/p VC state vectors at downstream: For A: G=A | R=Px | O=VCx | P=<head A> <tail A> | C=# For B: G=R | R=x | O=x | P=<head B> <tail??> | C=x 10/21/2019
Speculation and Lookahead Reduce latency by reducing pipe stages Speculation (and lookahead) Speculate virtual channel allocation: Do VA and SA concurrently If VC set from RC spans more than 1 port speculate that as well Successful Unsuccessful Lookahead: Do route compute for node I at node I-1 Start at VA at each node; overlap NRC & VA 10/21/2019
Flit and Credit Format Two ways to distinguish credits/flits: Piggybacking Credit: Include a credit field on each flit No types required Define types: I.e. 10 start credit, 11 start flit, 0x idle Flit Format: Credit Format: Head Flit VC Type (Credit) Route info Payload CRC Body Flit VC Type (Credit) Payload CRC Credit VC Type Check 10/21/2019
ROUTER COMPONENTS Datapath: Control: Input buffer Switch Output unit Hold waiting flits Switch Route flits from I/p o/p Output unit Send flits to downstream Control: Arbiter Grant access to shared resources Allocator Allocate VCs to packets and switch time to flits 10/21/2019
Input Buffer Smoothes down flit traffic Hold flits awaiting: VCs Switch BW or Channel BW Organization: Centralized Partitioned into physical channels Partitioned into VCs 10/21/2019
Centralized Input Buffer Combined single memory across entire router No separate switch, but Need to multiplex I/ps to memory Need to demultiplex memory o/p to o/p ports PRO: Flexibility in allocating memory space CONs: High Memory BW requirement 2xI (write I I/ps read I o/ps per flit time) Flit deserialization / reserialization latency Need to get I flits from VCs before writing to MEM I : node degree 10/21/2019
Partitioned Input Buffers 1 buffer per physical I/p port: Each Memory BW: 2 (1 read, 1 write) Buffers shared across VCs for a fixed port Buffers not shared across ports Less flexibility 1 buffer per VC: Enable switch I/p speedup Obviously, bigger switch Too fine granularity Inefficient mem usage Intermediate solutions: Mem[ even VC] Mem[ odd VC] Memory 10/21/2019
Input Buffer Data Structures Data structures required to: Track flit/ packet locations in Memory Manage available free memory Allocate multiple VCs Prevent blocking Two common types: Circular buffers Static, simpler yet inefficient mem usage Linked Lists Dynamic, complex, but fairer mem usage Nomenclature: Buffer (flit buffer): entire structure Cell (flit cell): storage for a single flit 10/21/2019
Circular Buffer FIXED! First and Last ptrs Specify the memory boundary for a VC Head and Tail specify current content boundary Flit added from tail Tail incremented (modular) Tail = Head Buffer Full Flit removed from head Head incremented (modular) Head = Tail Buffer empty Choose size N power of 2 so that LSB log(N) bits do the circular increment I.e. like cache line index & byte offset a,b,c,d removed g,h,i,j added 10/21/2019
Linked List Buffer Each cell has a ptr field for next cell Head and Tail specify 1st and last cells NULL for empty buffers Free List: Linked list of free cells Free points to head of list Counter registers Count of allocated cells for each buffer Count of cells in free list Bit errors have more severe effect compared to circular buffer Add cell e Remove cell a 10/21/2019
Buffer Memory Allocation Prevent greedy VC to flood all memory and block! Add a count register to each I/p VC state vector Keep number of allocated cells Additional counter for free list Simple Policy: Reserve 1 cell for each VC Add flit to bufferVCi if: bufferVCi empty or #(free list) > #(empty VCs) Detailed policy: Sliding Limit Allocator (r: # reserved cells per buffer, f: fraction of empty space to use) Add flit to bufferVCi if: |bufferVCi|<r or r<|bufferVCi|<f.#(free list) + r f=r=1 same as simple policy 10/21/2019
SWITCH Core: directs packets/flits to their destination Speedup: provided switch BW / Min. required switch BW for full thruput on all I/ps and o/ps of router Adding speedup simplifies allocation and reveals higher thruput and lower latency Realizations: Bus switch Crossbar Network switch 10/21/2019
Bus Switches Switches in time Input port accumulates P phits of a flit, arbirates for bus, transmits P phits over the bus to any o/p unit I.e. P=3 <fig. 17.5: P=4> Feasible only if flits have # phits > P (preferably int x P) Fragmentation Loss: If phits per flit not multiple of P P: number of I/p switch ports 10/21/2019
Bus timing diagram Could actually start here! 10/21/2019 P: number of I/p switch ports 10/21/2019
Bus Pros & Cons Simple switch allocation Wasted port BW I/p port owning bus can access all o/p ports Multicast made easy Wasted port BW Port BW: b Router BW=Pb Bus BW=Pb I/p deserializer BW=Pb o/p serializer BW=Pb Available internal BW: PxPb=P2b Used bus BW: Pb (speedup = 1) Increased Latency 2P worst case <see 17.6-bus timing diagram> Can vary from P+1 to 2P (phit times) P: number of I/p switch ports 10/21/2019
Xbar Switches 2 1 3 4 Primary issue: speedup 1. kxk no speedup - fig 17.10(a) 2. skxk I/p speedup=s – fig 17.10(b) 3. kxsk o/p speedup=s – fig 17.11(a) 4. skxsk speedup=s – fig 17.11(b) (Speedup simplifies allocation) 1 2 3 4 10/21/2019
Xbar Throughput O/p speedup: Overall speedup (both I/p & o/p) Ex: Random separable allocator, I/p speedup=s, uniform traffic: Thruput: =P{at least one of the sk flits are destined for given o/p} =1-P{none of the sk I/ps choose given o/p}=1-[(k-1)/k]sk Thruput =1-[(k-1)/k]sk s=k thruput=100% (doesn’t verify as above!!) O/p speedup: Need to implement reverse allocator More complicated for same gain Overall speedup (both I/p & o/p) Can achieve > 100% thruput Cannot sustain since: o/p buffer will expand to inf. and I/p buffers need to be initially filled with inf. # of flits I/p speedup: si & o/p speedup: so (si>so) Similar to I/p speedup=(si/so), with overall speedup so Thruput: 10/21/2019
Network Switches A network of smaller switches Reduces # of crosspoints Localize logic Reduces wire length Requires complex control or intermediate buffering Not very profitable! Ex: 7x7 switch as 3 3x3 switches 3x9=27 switches instead of 7x7=49 10/21/2019
OUTPUT UNIT Essentially a FIFO to match switch speed If switch o/p speedup=1: merely latch the flits to downstream No need to partition across VCs Provide backpressure to SA to prevent buffer overflow SA should block traffic to the choking o/p buffer 10/21/2019
ARBITER not holding holding 1 cycle grant 4 cycle grant holding grant Resolve multiple requests for a single source (N1) Building blocks for allocators (N1N2) Communication and timing: not holding holding 1 cycle grant 4 cycle grant holding grant 10/21/2019
Arbiter Types fixed 1 gets resource variable Types: Fixed Priority: r0> r1> r2>… Variable (iterative) Priority: rotate priorities Make a carry chain, hot 1 inserted from priority inputs I.e. r1 > r2 > …>r0 (p0,p1,p2,…,pn)=010…0 Matrix: implements a LRS scheme Uses a triangular array M(r,c)=1 RQr > RQc Queueing: First come, first served <The bank/STA Travel style> Ticket counter: Gives current ticket to requester Increments with each ticket Served counter: Stores current served requester's number Increments for next customer 1 gets resource LRS: least recently served variable 10/21/2019
ALLOCATOR maximal maximum Provides matching: nxm allocator Multiple Resources Multiple Requesters I.e. switch allocator: Every cycle match I/p ports o/p ports 1 flit per I/p port 1 flit goes to each o/p port nxm allocator rij: requester i wants access to resource j gij: requester i granted access to resource j Request & Grant Matrices: Allocation rules gij => rij: Grant if requested gij => No other gik: Only 1 grant for each requester I/p gij => No other gkj: Only 1 grant for each resource o/p maximal maximum 10/21/2019
Allocation Problem Can be represented as finding the maximum matching grant matrix Also a maximum matching in a bipartite graph: Exact algorithms: Augmenting path method Always finds maximum matching Not feasible in time budget Faster Heuristics: Separable allocators: 2 sets of arbiration: Across I/ps & across o/ps In either order: I/p first OR o/p first R G2 10/21/2019
4x3 Input-first Separable Allocator 10/21/2019