Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Lecture 4. Topics covered in last lecture Multistage Switching (Clos Network) Architecture of Clos Network Routing in Clos Network Blocking Rearranging.
1 CNPA B Nasser S. Abouzakhar Queuing Disciplines Week 8 – Lecture 2 16 th November, 2009.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
EE 122: Router Design Kevin Lai September 25, 2002.
CSE 291-a Interconnection Networks Lecture 15: Router (cont’d) March 5, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter 2007 Transcribed by Ling.
1 Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
TO p. 1 Spring 2006 EE 5304/EETS 7304 Internet Protocols Tom Oh Dept of Electrical Engineering Lecture 9 Routers, switches.
1 Lecture 26: Networks, Storage Topics: router microarchitecture, disks, RAID (Appendix D) Final exam: Monday 30 th Apr 10:30-12:30 Same rules as the midterm.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
Final Chapter Packet-Switching and Circuit Switching 7.3. Statistical Multiplexing and Packet Switching: Datagrams and Virtual Circuits 4. 4 Time Division.
Router Architecture. December 21, 2015SoC Architecture2 Network-on-Chip Information in the form of packets is routed via channels and switches from one.
Lecture 16: Router Design
Ch 8. Switching. Switch  Devices that interconnected with each other  Connecting all nodes (like mesh network) is not cost-effective  Some topology.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
© Oxford University Press All rights reserved. Data Structures Using C, 2e Reema Thareja.
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Overview Parallel Processing Pipelining
Data Structures Using C, 2e
The network-on-chip protocol
Buffer Management and Arbiter in a Switch
Lecture 23: Interconnection Networks
Buffer Management in a Switch
Switching and High-Speed Networks
CS 268: Router Design Ion Stoica February 27, 2003.
Packet Forwarding.
Physical constraints (1/2)
Chapter 6 Delivery & Forwarding of IP Packets
Lectures Queues Chapter 8 of textbook 1. Concepts of queue
Addressing: Router Design
Interconnection Networks: Flow Control
Lecture 23: Router Design
Chapter 3 Part 3 Switching and Bridging
What’s “Inside” a Router?
Lecture: Interconnection Networks
Switching Techniques.
Implementing an OpenFlow Switch on the NetFPGA platform
Circuit Switching Packet Switching Message Switching
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Lecture: Networks Topics: TM wrap-up, networks.
Lecture: Interconnection Networks
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Switching Chapter 2 Slides Prepared By: -
Presentation transcript:

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19> ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI

ROUTER ARCHITECTURE Router: Implements: Pipelined Registers Switches Functional Units Control Logic Implements: Routing & Flow Control Pipelined Use credits for buffer space Flits  <downstream> Credits  <upstream> Constitute the credit loop 10/21/2019

ROUTER Diagram Virtual Channel Router Datapath: Control: Input Units | Switch | Output Units Control: Router, VC allocator, Switch Allocator Input Unit: State Vector ( for each VC) & Flit Buffer (for each C) State vector fields: GROPC >>>> Output Unit: Latches outgoing flits State vector (GIC) >>>>>>>>> Switch: Connect I/p to o/p according to SA VCA: Arbitrate o/p channel RQs from each I/p packet Once for each packet!! SA: Arbirates o/p port RQs from I/p ports Done for each flit Router: Determines o/p ports for packets 10/21/2019

VC State Fields x(# of VCs) x(# of VCs) Input virtual channel: G  Global State I, R, V, A, C R  Route O/p port for packet O  o/p VC O/p VC of port R for packet P  Pointers Flit head and tail pointers C  Credit Count # of credits for o/p VC R.O Output virtual channel: I, A, C I  I/p VC I/p port.VC forwarding this o/p VC C  Credit count # of free buffers at the downstream x(# of VCs) x(# of VCs) 10/21/2019

How it works 1)Packet  Input controller   Route Determined Router  o/p port (I.e. P3) VCA  o/p VC (I.e. P3.VC2)  Route Determined 2)Each flit  input controller  SA  timeslot over Switch Flit forwarded to o/p unit 3)Each flit  output unit  Drive downstream physical channel Flit Transferred 10/21/2019

Router Pipeline RC VA SA ST TX Route Compute: VC Allocate: Define the o/p port for packet header VC Allocate: Assign a VC from the port if available Switch Allocate: Schedule switch state according to o/p port requests Switch Traverse: I/p drives the switch for o/p port Transmit: Transmit the flit over downstream channel RC VA SA ST TX RC, VA  Only for header O/p channel is assigned for whole packet SA, ST, TX  for all flits Flits from different packets compete continuously Flits Transmitted sequentially for routing in next hops 10/21/2019

Pipeline Walkthrough (0):<start> (1):<RC> (2):<VA> P4.VC3: (I/p VC) G=I | R=x | O=x | P=x | C=x Packet Arrives at I/p port P4 Packet header  VC3  Packet stored in P4.VC3 (1):<RC> P4.VC3: G=R | R=x | O=x | P=<head>,<tail??> | C=x Packet Header  Router  select o/p port: P3 (2):<VA> G=V | R=P3| O=x | P=<head>,<tail??> | C=x P3.VC2: (o/p VC) G=I | I=x | C=x P3  VCA  Allocate VC for o/p port P3: VC2 10/21/2019

…Pipeline Walkthrough (3):<SA> P4.VC3: (i/p VC) G=A | R=P3| O=VC2 | P=<head>,<tail??> | C=# P3.VC2: (o/p VC) G=A | I=P4.VC3 | C=# Packet Processing complete Flit by flit switch allocation/traverse & Transmit Head flit allocated on switch  Move pointers Decrement P4.VC3.Credit Send a credit to upstream node to declare the available buffer space (4):<ST> Head flit arrives at output VC (5):<TX> Head flit transmitted to downstream (6):<Tail in SA> Packet done (7):<Release Resources> G=I or R (if new packet already waiting) | R=x| O= x | P= x | C= x G=I | I=x | C=x 10/21/2019

Pipeline Stalls P1 P2, P3, F1 F2 F3 Packet Stalls: Flit Stalls: P1) I/p VC Busy stall P2) Routing stall P3) VC Allocation stall Flit Stalls: F1) Switch Allocation Stall F2) Buffer empty stall F3) Credit stall Credit Return cycles: pipeline(4)+RndTrip(4)+CT(1)+CU(1)+NextSA(1)=11 P1 P2, P3, F1 Credit return cycles: SA ST: # credit – ST | W1 | W2 | RC | VA | SA  new credit from Downstream CT | W1 | W2  credit reaches upstream CU |  credit ++ SA  new switch allocation F2 F3 10/21/2019

Channel Reallocation 1) Conservative Wait until credit received for tail from downstream to reallocate o/p VC 2) Aggressive – single Global state Reallocate o/p VC when tail passes SA (same as VA stall) Reallocate downstream I/p VC when tail passes SA (Same as I/p VC busy stall) 10/21/2019

…Channel Reallocation 2) Aggressive – Double Global state Reallocate o/p VC when tail passes SA (same as VA stall) Eliminate I/p VC busy stall Needs 2 I/p VC state vectors at downstream: For A: G=A | R=Px | O=VCx | P=<head A> <tail A> | C=# For B: G=R | R=x | O=x | P=<head B> <tail??> | C=x 10/21/2019

Speculation and Lookahead Reduce latency by reducing pipe stages  Speculation (and lookahead) Speculate virtual channel allocation: Do VA and SA concurrently If VC set from RC spans more than 1 port speculate that as well Successful Unsuccessful Lookahead: Do route compute for node I at node I-1 Start at VA at each node; overlap NRC & VA 10/21/2019

Flit and Credit Format Two ways to distinguish credits/flits: Piggybacking Credit: Include a credit field on each flit No types required Define types: I.e. 10 start credit, 11  start flit, 0x  idle Flit Format: Credit Format: Head Flit VC Type (Credit) Route info Payload CRC Body Flit VC Type (Credit) Payload CRC Credit VC Type Check 10/21/2019

ROUTER COMPONENTS Datapath: Control: Input buffer Switch Output unit Hold waiting flits Switch Route flits from I/p  o/p Output unit Send flits to downstream Control: Arbiter Grant access to shared resources Allocator Allocate VCs to packets and switch time to flits 10/21/2019

Input Buffer Smoothes down flit traffic Hold flits awaiting: VCs Switch BW or Channel BW Organization: Centralized Partitioned into physical channels Partitioned into VCs 10/21/2019

Centralized Input Buffer Combined single memory across entire router No separate switch, but Need to multiplex I/ps to memory Need to demultiplex memory o/p to o/p ports PRO: Flexibility in allocating memory space CONs: High Memory BW requirement 2xI (write I I/ps read I o/ps per flit time) Flit deserialization / reserialization latency Need to get I flits from VCs before writing to MEM I : node degree 10/21/2019

Partitioned Input Buffers 1 buffer per physical I/p port: Each Memory BW: 2 (1 read, 1 write) Buffers shared across VCs for a fixed port Buffers not shared across ports Less flexibility 1 buffer per VC: Enable switch I/p speedup Obviously, bigger switch Too fine granularity Inefficient mem usage Intermediate solutions: Mem[ even VC] Mem[ odd VC] Memory 10/21/2019

Input Buffer Data Structures Data structures required to: Track flit/ packet locations in Memory Manage available free memory Allocate multiple VCs Prevent blocking Two common types: Circular buffers Static, simpler yet inefficient mem usage Linked Lists Dynamic, complex, but fairer mem usage Nomenclature: Buffer (flit buffer): entire structure Cell (flit cell): storage for a single flit 10/21/2019

Circular Buffer FIXED! First and Last ptrs Specify the memory boundary for a VC Head and Tail specify current content boundary Flit added from tail Tail incremented (modular) Tail = Head  Buffer Full Flit removed from head Head incremented (modular) Head = Tail  Buffer empty Choose size N power of 2 so that LSB log(N) bits do the circular increment I.e. like cache line index & byte offset a,b,c,d removed g,h,i,j added 10/21/2019

Linked List Buffer Each cell has a ptr field for next cell Head and Tail specify 1st and last cells NULL for empty buffers Free List: Linked list of free cells Free points to head of list Counter registers Count of allocated cells for each buffer Count of cells in free list Bit errors have more severe effect compared to circular buffer Add cell e Remove cell a 10/21/2019

Buffer Memory Allocation Prevent greedy VC to flood all memory and block! Add a count register to each I/p VC state vector Keep number of allocated cells Additional counter for free list Simple Policy: Reserve 1 cell for each VC Add flit to bufferVCi if: bufferVCi empty or #(free list) > #(empty VCs) Detailed policy: Sliding Limit Allocator (r: # reserved cells per buffer, f: fraction of empty space to use) Add flit to bufferVCi if: |bufferVCi|<r or r<|bufferVCi|<f.#(free list) + r f=r=1  same as simple policy 10/21/2019

SWITCH Core: directs packets/flits to their destination Speedup: provided switch BW / Min. required switch BW for full thruput on all I/ps and o/ps of router Adding speedup simplifies allocation and reveals higher thruput and lower latency Realizations: Bus switch Crossbar Network switch 10/21/2019

Bus Switches Switches in time Input port accumulates P phits of a flit, arbirates for bus, transmits P phits over the bus to any o/p unit I.e. P=3  <fig. 17.5: P=4> Feasible only if flits have # phits > P (preferably int x P) Fragmentation Loss: If phits per flit not multiple of P P: number of I/p switch ports 10/21/2019

Bus timing diagram Could actually start here! 10/21/2019 P: number of I/p switch ports 10/21/2019

Bus Pros & Cons Simple switch allocation Wasted port BW I/p port owning bus can access all o/p ports Multicast made easy Wasted port BW Port BW: b  Router BW=Pb  Bus BW=Pb  I/p deserializer BW=Pb  o/p serializer BW=Pb  Available internal BW: PxPb=P2b Used bus BW: Pb (speedup = 1) Increased Latency 2P worst case <see 17.6-bus timing diagram> Can vary from P+1 to 2P (phit times) P: number of I/p switch ports 10/21/2019

Xbar Switches 2 1 3 4 Primary issue: speedup 1. kxk  no speedup - fig 17.10(a) 2. skxk  I/p speedup=s – fig 17.10(b) 3. kxsk  o/p speedup=s – fig 17.11(a) 4. skxsk  speedup=s – fig 17.11(b) (Speedup simplifies allocation) 1 2 3 4 10/21/2019

Xbar Throughput O/p speedup: Overall speedup (both I/p & o/p) Ex: Random separable allocator, I/p speedup=s, uniform traffic: Thruput: =P{at least one of the sk flits are destined for given o/p} =1-P{none of the sk I/ps choose given o/p}=1-[(k-1)/k]sk  Thruput =1-[(k-1)/k]sk s=k  thruput=100% (doesn’t verify as above!!) O/p speedup: Need to implement reverse allocator More complicated for same gain Overall speedup (both I/p & o/p) Can achieve > 100% thruput Cannot sustain since: o/p buffer will expand to inf. and I/p buffers need to be initially filled with inf. # of flits I/p speedup: si & o/p speedup: so (si>so) Similar to I/p speedup=(si/so), with overall speedup so  Thruput: 10/21/2019

Network Switches A network of smaller switches Reduces # of crosspoints Localize logic Reduces wire length Requires complex control or intermediate buffering Not very profitable! Ex: 7x7 switch as 3 3x3 switches 3x9=27 switches instead of 7x7=49 10/21/2019

OUTPUT UNIT Essentially a FIFO to match switch speed If switch o/p speedup=1: merely latch the flits to downstream No need to partition across VCs Provide backpressure to SA to prevent buffer overflow SA should block traffic to the choking o/p buffer 10/21/2019

ARBITER not holding holding 1 cycle grant 4 cycle grant holding grant Resolve multiple requests for a single source (N1) Building blocks for allocators (N1N2) Communication and timing: not holding holding 1 cycle grant 4 cycle grant holding grant 10/21/2019

Arbiter Types fixed 1 gets resource variable Types: Fixed Priority: r0> r1> r2>… Variable (iterative) Priority: rotate priorities Make a carry chain, hot 1 inserted from priority inputs I.e. r1 > r2 > …>r0  (p0,p1,p2,…,pn)=010…0 Matrix: implements a LRS scheme Uses a triangular array M(r,c)=1  RQr > RQc Queueing: First come, first served <The bank/STA Travel style> Ticket counter: Gives current ticket to requester Increments with each ticket Served counter: Stores current served requester's number Increments for next customer 1 gets resource LRS: least recently served variable 10/21/2019

ALLOCATOR maximal maximum Provides matching: nxm allocator Multiple Resources  Multiple Requesters I.e. switch allocator: Every cycle match I/p ports  o/p ports 1 flit per I/p port 1 flit goes to each o/p port nxm allocator rij: requester i wants access to resource j gij: requester i granted access to resource j Request & Grant Matrices: Allocation rules gij => rij: Grant if requested gij => No other gik: Only 1 grant for each requester I/p gij => No other gkj: Only 1 grant for each resource o/p maximal maximum 10/21/2019

Allocation Problem Can be represented as finding the maximum matching grant matrix Also a maximum matching in a bipartite graph: Exact algorithms: Augmenting path method Always finds maximum matching Not feasible in time budget Faster Heuristics: Separable allocators: 2 sets of arbiration: Across I/ps & across o/ps In either order: I/p first OR o/p first R G2 10/21/2019

4x3 Input-first Separable Allocator 10/21/2019