Presentation is loading. Please wait.

Presentation is loading. Please wait.

Network-on-Chip (1/2) Ben Abdallah, Abderazek The University of Aizu 1 KUST University, March 2011.

Similar presentations


Presentation on theme: "Network-on-Chip (1/2) Ben Abdallah, Abderazek The University of Aizu 1 KUST University, March 2011."— Presentation transcript:

1 Network-on-Chip (1/2) Ben Abdallah, Abderazek The University of Aizu E-mail: benab@u-aizu.ac.jp 1 KUST University, March 2011

2 Part 1 Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by NoC Traffic abstractions Data Abstraction Network delay modeling Part 1 Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by NoC Traffic abstractions Data Abstraction Network delay modeling 2

3 Application Requirements Signal processing o Hard real time o Very regular load o High quality Typically on DSPs Media processing o Hard real time o Irregular load o High quality SoC/media processors Multimedia o Soft real time o Irregular load o Limited quality PC/desktop Very challenging! 3

4 What the Internet Needs? High processing power Support wire speed Programmable Scalable Specially for network applications Increasing Huge Amount of Packets & Routing, Packet Classification, Encryption, QoS, New Applications and Protocols, etc….. General Purpose RISC (not capable enough) ASIC (large, expensive to develop, not flexible ) 4 MCSoC

5 Example - Network Processor (NP) 5 IBM PowerNP

6 Example - Network Processor (NP) 6  NP can be applied in various network layers and applications Traditional apps – forwarding, classification Advanced apps – transcoding, URL-based switching, security etc. New apps

7 Telecommunication Systems and NoC Paradigm  The trend nowadays is to integrate telecommunication system on complex multicore SoC (MCSoC):  Network processors,  Multimedia hubs,and  base-band telecom circuits  These applications have tight time-to- market and performance constraints 7

8 Telecommunication Systems and NoC Paradigm  Telecommunication multicore SoC is composed of 4 kinds of components:  Software tasks  Processors executing software  Specific hardware cores, and  Global on-chip communication network 8

9 Telecommunication Systems and NoC Paradigm  Telecommunication multicore SoC is composed of 4 kinds of components: 1.Software tasks, 2.Processors executing software, 3.Specific hardware cores, and 4.Global on-chip communication network 9 This is the most challenging part

10 Technology & Architecture Trends  Technology trends:  Vast transistor budgets  Relatively poor interconnect scaling  Need to manage complexity and power  Build flexible designs (multi-/general- purpose)  Architectural trends:  Go parallel ! 10

11 OperationDelay (.13mico) Delay (.05micro ) 32-bit ALU Operation650ps250ps 32-bit Register read325ps125ps Read 32-bit from 8KB RAM780ps300ps Transfer 32-bit across chip (10mm) 1400ps2300ps Transfer 32-bit across chip (200mm)2800ps4600ps 2:1 global on-chip communication to operation delay 9:1 in 2010 Ref: W.J. Dally HPCA Panel presentation 2002 Wire Delay vs. Logic Delay 11

12  Information transfer is inherently unreliable at the electrical level, due to:  Timing errors  Cross-talk  Electro-magnetic interference (EMI)  Soft errors  The problem will get increasingly worse as technology scales down Communication Reliability 12

13 Evolution of on-chip Communication 13

14 Traditional SoC Nightmare  Variety of dedicated interfaces  Design and verification complexity  Unpredictable performance  Many underutilized wires 14 DMA CPUDSP Bridge IO C A B Peripheral Bus CPU Bus Control signals

15 Network on Chip: A paradigm Shift in VLSI 15 s s s s ss s s Module s From: Dedicated signal wiresTo: Shared network Point- To-point Link Network switch Computing Module

16 Network on Chip: A paradigm Shift in VLSI NINI DRAMDRAM switchswitch NINI CPUCPU AccelAccel NINI switchswitch switchswitch switchswitch NoC switchswitch switchswitch NINI DRAMDRAM CoprocCoproc I/OI/O NINI NINI NINI NINI NINI NINI NINI NINI DSPDSP DMADMA MPEGMPEG DMADMA EthntEthnt EthntEthnt 16

17 NoC Essential  Communication by packets of bits  Routing of packets through several hops, via switches  Efficient sharing of wires  Parallelism 17 s s s s ss s s Module s

18 NoC Operation Example 1. CPU request 2. Packetization and trans. 3. Routing 4. Receipt and unpacketization (AHB, OCP,... pinout) 5. Device response (if needed) 6. Packetization and transmission 7. Routing 8. Receipt and unpacketization CPUCPU I/OI/O Network Interface switchswitch switchswitch switchswitch 18

19 Characteristics of a Paradigm Shift  Solves a critical problem  Step-up in abstraction  Design is affected:  Design becomes more restricted  New tools  The changes enable higher complexity and capacity  Jump in design productivity 19

20 Characteristics of a Paradigm shift  Solves a critical problem  Step-up in abstraction  Design is affected:  Design becomes more restricted  New tools  The changes enable higher complexity and capacity  Jump in design productivity 20

21 Origins of the NoC Concept  The idea was talked about in the 90’s, but actual research came in the new illenium. Some well-known early publications:  Guerrier and Greiner (2000) “A generic architecture for on-chip packet-switched interconnections”  Hemani et al. (2000) “Network on chip: An architecture for billion transistor era”  Dally and Towles (2001) “Route packets, not wires: on-chip interconnection networks”  Wingard (2001) “MicroNetwork-based integration of SoCs”  Rijpkema, Goossens and Wielage (2001) “A router architecture for networks on silicon”  De Micheli and Benini (2002) “Networks on chip: A new paradigm for systems on chip design” 21

22 Don't we already know how to design interconnection networks?  Many existing network topologies, router designs and theory has already been developed for high end supercomputers and telecom switches  Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs!! 22

23 Critical problems addressed by NoC 23 1) Global interconnect design problem: delay, power, noise, scalability, reliability 2 ) System integration productivity problem 3) Chip Multi Processors (key to power-efficient computing

24 1(a): NoC and Global wire delay 24 Long wire delay is dominated by Resistance Add repeaters Repeaters become latches (with clock frequency scaling ) Latches evolve to NoC routers NoC Router

25 1(b): Wire design for NoC 25  NoC links :  Regular  Point-to-point (no fanout tree)  Can use transmission-line layout  Well-defined current return path  Can be optimized for noise /speed/power ?  Low swing, current mode, ….

26 1(c): NoC Scalability 26 NoC: O(n) Point –to-Point O(n^2 √n) O(n √n) Simple Bus O(n^3 √n) O(n√n) Segmented Bus: O(n^2 √n) O(n√n)  For Same Performance, compare the wire area and power

27 1(d): NoC and Communication Reliability 27 Router UMODEM UMODEMUMODEM UMODEMUMODEM Router UMODEM UMODEMUMODEM UMODEMUMODEM Router … Error correction Synchronization ISI reduction Parallel to Serial Convertor Modulation Link Interface Interconnect Input buffer m n  Fault tolerance & error correction A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for NOC communications”, ICECS 2004

28 1(e): NoC and GALS  Modules in NoC System use different clocks  May use different voltages  NoC can take care of synchronization  NoC design may be asynchronous  No waste of power when the links and routers are idle 28

29 2: NoC and Engineering Productivity  NoC eliminates ad-hoc global wire engineering  NoC separates computation from communication  NoC supports modularity and reuse of cores  NoC is a platform for system integration, debugging and testing 29

30 3: NoC and Multicore  Uniprocessors cannot provide Power- efficient performance growth  Interconnect dominates dynamic power  Global wire delay doesn’t scale  Instruction-level parallelism is limited 30 Gate Interconnect Diff. Uniprocessor dynamic power (Magen et al., SLIP 200

31 3: NoC and Multicore  Power-efficiency requires many parallel local computations  Chip Multi Processors (Multicore)  Thread-Level Parallelism (TLP) 31 Uniprocessor Performance Die Area (or Power)

32 3: NoC and Multicore  Uniprocessors cannot provide Power-efficient performance growth  Interconnect dominates dynamic power  Global wire delay doesn’t scale  Instruction-level parallelism is limited  Power-efficiency requires many parallel local computations  Chip Multi Processors (CMP)  Thread-Level Parallelism (TLP)  Network is a natural choice for CMP! 32

33 3: NoC and Multicore  Uniprocessors cannot provide Power-efficient performance growth  Interconnect dominates dynamic power  Global wire delay doesn’t scale  Instruction-level parallelism is limited  Power-efficiency requires many parallel local computations  Chip Multi Processors (CMP)  Thread-Level Parallelism (TLP)  Network is a natural choice for CMP! 33 Network is a natural choice for Multicore

34 Why Now is the time for NoC? 34 Difficulty of DSM wire design Productivity pressure Multicore

35 Traffic Abstractions  Traffic model are generally captured from actual traces of functional simulation  A statically distribution is often assumed for message 35 PE1PE2 PE3 PE4 PE12 PE11 PE5 PE9 PE7 PE8PE6 PE10

36 Data Abstractions 36

37 Layers of Abstraction in Network Modeling  Software layers  Application, OS  Network & transport layers  Network topology e.g. crossbar, ring, mesh, torus, fat tree,…  Switching Circuit / packet switching(SAF, VCT), wormhole  Addressing Logical/physical, source/destination, flow, transaction  Routing Static/dynamic, distributed/source, deadlock avoidance  Quality of Service e.g. guaranteed-throughput, best- effort  Congestion control, end-to-end flow control 37

38 Layers of Abstraction in Network Modeling  Data link layer  Flow control (handshake)  Handling of contention  Correction of transmission errors  Physical layer  Wires, drivers, receivers, repeaters, signaling, circuits,.. 38

39 How to Select Architecture ?  Architecture choices depends on system needs. 39 ASIC FPGA ASSP CMP/ Multicore Reconfiguration Rate During run time At boot time At design time Single application General purpose or Embedded systems Flexibility

40 How to Select Architecture ?  Architecture choices depends on system needs 40 ASIC FPGA ASSP CMP/ Multicore Reconfiguration Rate During run time At boot time At design time Single application General purpose or Embedded systems Flexibility A large range of solutions!

41 Example: OASIS NoC 41 A. Ben Abdallah, and M. Sowa, "Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization", (TJASSST), Dec. 4-9th, 2006.Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization

42 Perspective 1: NoC vs. Bus  Aggregate bandwidth grows  Link speed unaffected by N  Concurrent spatial reuse  Pipelining is built-in  Distributed arbitration  Separate abstraction layers However:  No performance guarantee  Extra delay in routers  Area and power overhead?  Modules need NI  Unfamiliar methodology 42  Bandwidth is shared  Speed goes down as N grows  No concurrency  Pipelining is tough  Central arbitration  No layers of abstraction (communication and computation are coupled) However:  Fairly simple and familiar NoC Bus

43 Perspective 2: NoC vs. Off-chip Networks  Cost is in the links  Latency is tolerable  Traffic/applications unknown  Changes at runtime  Adherence to networking standards 43  Sensitive to cost:  area  power  Wires are relatively cheap  Latency is critical  Traffic may be known a- priori  Design time specialization  Custom NoCs are possible Off-Chip Networks NoC

44 VLSI CAD Problems  Application mapping  Floorplanning / placement  Routing  Buffer sizing  Timing closure  Simulation  Testing 44

45 VLSI CAD Problems in NoC  Application mapping (map tasks to cores)  Floorplanning / placement (within the network)  Routing (of messages)  Buffer sizing (size of FIFO queues in the routers)  Simulation (Network simulation, traffic/delay/power modeling)  Other NoC design problems (topology synthesis, switching, virtual channels, arbitration, flow control,……) 45

46 Typical NoC Design Flow 46 Place Modules Determine routing and adjust link capacities

47 Timing Closure in NoC 47 Define inter- module traffic Place modules Increase link capacities QoS satisfie d ? No Yes Finish  Too long capacity results in poor QoS  Too high capacity wastes area  Uniform link capacities are a waste in ASIP system

48 NoC Design Requirements  High-performance interconnect  High-throughput, latency, power, area  Complex functionality ( performance again)  Support for virtual-channels  QoS  Synchronization  Reliability, high-throughput, low-latency 48

49 Part 2 Routing algorithms Flow control schemes Clocking Schemes QoS Putting it all together 49

50 NoC Routing Algorithms  They must prevent deadlock, livelock, and starvation  Choice of a routing algorithm depends on:  Minimizing power required for routing  Minimizing logic and routing tables 50 Responsible for correctly and efficiently routing packets or circuits from source to destination

51 Routing Algorithm Classifications 3 different criteria:  Where the routing decision are taken ?  Source routing  Distributed routing  How a path is defined ?  Static (deterministic)  Adaptive  The path length  Minimal  Nonminimal 51

52 NoC Routing-Table  The Routing Table determines for each PE the route via which it will send packets to other PEs  The routing table directly influences traffic in the NoC  Here we can also distinguish between 2 methods:  Static routing  Dynamic (adaptive) routing 52

53 Static Routing  The Routing Table is constant.  The route is embedded in the packet header and the routers simply forward the packet to the direction indicated by the header  The routers are passive in their addressing of packets (simple routers) 53

54 Dynamic Routing  The routing table can change dynamically during operation  Logically, a route is changed when it becomes slow due to other traffic  Possibly out-of-order arrival of packets.  Usually requires more virtual channels 54

55 Dynamic Routing  More resources needed to monitor state of the network 55 321 213 X Adaptive routing scheme Packet

56 Routing Algorithms Requirements  Routing algorithm must ensure freedom from deadlocks  e.g. cyclic dependency shown below  Routing algorithm must ensure freedom from livelocks and starvation 56

57 Flow Control Schemes - STALL/GO  Low overhead scheme  Requires only two control wires  One going forward and signaling data availability  the other going backward and signaling either a condition of buffers filled (STALL) or of buffers free (GO) 57

58  Repeaters are two-entry FIFOs  No retransmission allowed SR FLIT REQ STALL FLIT REQ STALL FLIT REQ STALL Flow Control Schemes - STALL/GO 58

59 Flow Control Schemes - T-Error  More aggressive scheme that can detect faults  by making use of a second delayed clock at every buffer stage  Delayed clock re-samples input data to detect any inconsistencies 59

60 Flow Control Schemes - ACK/NACK  When flits are sent on a link, a local copy is kept in a buffer by sender  When ACK received by sender, it deletes copy of flit from its local buffer  When NACK is received, sender rewinds its output queue and starts resending flits, starting from the corrupted one 60

61  Repeaters are pure registers  Buffering and retransmission logic in the sender Flow Control Schemes - ACK/NACK SR FLIT REQ ACK FLIT REQ ACK FLIT REQ ACK 61

62 ACK/NACK propagation Memory deallocation NACK Retransmission Go-back-N Transmission Example: ACK/NACK Flow Control ACK and buffering 62

63 Clocking Schemes  Fully synchronous  Single global clock is distributed to synchronize entire chip  hard to achieve in practice, due to process variations and clock skew  Mesochronous  Local clocks are derived from a global clock  Not sensitive to clock skew  Pleisochronous  clock signals are produced locally  Asynchronous  clocks do not have to be present at all 63

64 Clocking Schemes  Mesochronous  Local clocks are derived from a global clock  Not sensitive to clock skew 64 PE SYNC NI SYNC SW CMU SYNC

65 Quality of Service (QoS)  QoS refers to the level of commitment for packet delivery  Three basic categories  Best effort (BE)  Only correctness and completion of communication is guaranteed  Usually packet switched  Guaranteed service (GS)  makes a tangible guarantee on performance, in addition to basic guarantees of correctness and completion for communication  Usually (virtual) circuit switched  Differentiated service  prioritizes communication according to different categories  NoC switches employ priority based scheduling and allocation policies 65

66 Putting it all Together 66

67 Packet Format 67 Message, Packet and Flit Formats

68 Size: Phit, Flit, Packet  There are no fixed rules for the size of phits, flits and packets  Message: arbitrarily long  Packets: restricted maximum length  Typical values  Phits: 1 bit to 64 bits  Flits: 16 bits to 512 bits  Packets: 128 bits to 1024 bits 68

69 NoC Queuing Schemes 69 1234 1324 1243 4231 CROSSBARCROSSBAR CROSSBARCROSSBAR Output 0 Output 1 Output 2 Output 3 Input 1 Input 0 Input 3 HOL blocking problem in Input Queuing

70 Flow Control Schemes - 70 Transmitter Stop Packet enable Receiver Stop & Go Flow Control Go threshold Stop threshold Buffer is occupied Buffer is released Minimum Buffer Size = Flit Size x ( R overhead + S overhead + 2 x Link delay) Roverhead: the required time to issue the stop signal at the received router Soverhead : the required time to stop sending a flit as soon as the stop signal is received

71 - Flow Control Schemes - 71 Transmitter Stop Packet enable Receiver Credit Based (CB) Flow Control Stop threshold Buffer is released Credit is decremented A flit is transferred Credit is incremented  CB makes the best use of channel buffers  Can be implemented regardless of the link length of the sender & receiver overhead

72 Queue and Buffer Design  Effective bandwidth of Data link later is influenced by the traffic pattern and Q size  Q Buffers consume most of the area and power among all NoC building blocks. 72 D Q sh Packet out Intermediate Empty Bubble Packet IN Conventional Shift Register Method

73 Network Interface  Different interface shall be connected to the network  The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol 73

74 Network Interface Open Core Protocol (OCP) Network Protocol core Network Interface Network PAYLOAD HEADER TAIL FLIT …  Transmit – Access routing tables – Assemble packets – Split into flits  Receive – Synchronize – Drop routing information 74

75 Bidirectional Link  2 * 32-bits data links  Asynchronous, credit based flow control  Easy floorplan routing & timing in DSM process 75 432 36 bits Tag Data

76 Router Design 76

77 Router Design with Cross Point 77 OQ1 Output Queues (Optional) 1234 IQ1 Switch Fabric Input Queues Output port Scheduler request grant 4-bit Grant ph-1:0 phit-1:0

78 Scheduler Design 78 PPE n Grant AnyGnt INCIENC log n Simple PE T.E. anyGnt n n Gnt n Termo encoder log n P_enc n Req n Round-roubin Algorithm circuits with 2 priority

79 Phit Size Determination  The phit size is the bit width of a link and determines the switch area 79 PE NI phit size = packet size/SERR Operation freq = SERR* f NORM Switch Buffers PE NI phit size = packet size Operation freq = f NORM Switch Buffers SER/DES

80 Part 3 Network Topology Network-on-Chip topology Direct topology Indirect topology Irregular topology 80

81 NoC Topology  Adopted from large-scale networks and parallel computing  Topology classifications:  Direct topologies  Indirect topologies 81 The connection map between PEs

82 Network Topology  A good topology allows to fulfill the requirements of the traffic at reasonable costs  Network topology can be compared with a network of roads 82

83 Direct Topology  Each switch (SW) is connected to a single PE  As the # of nodes in the system increases, the total bandwidth also increases 83 1 PE is connected to only a single SW PE SW

84 Direct Topology - Mesh  2D mesh is most popular  All links have the same length  Eases physical design  Area grows linearly with the the # of nodes 84 4x4 Mesh

85 Direct Topology - Torus and Folded Torus 85  Overcomes the long link limitation of a 2-D torus  Links have the same size  Similar to a regular Mesh  Excessive delay problem due to long-end-around connection

86 Direct Topology - Octagon  Messages being sent between any 2 nodes require at most two hops  More octagons can be tiled together to accommodate larger designs 86 PE SW

87 Indirect Topology  Fat tree  Nodes are connected only to the leaves of the tree  More links near root, where bandwidth requirements are higher 87 A set of PEs are connected to a switch. PE SW

88 Indirect Topology- k-ary n-fly butterfly network  Blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs 88 Example: 2-ary 3-fly butterfly network

89 Indirect topology - (m, n, r) symmetric Clos network 3-stage network in which each stage is made up of a number of crossbar switches 89

90 Indirect topology - Benes Network 90 Example: (2, 2, 4) re-arrangeable Clos network constructed using two (2, 2, 2) Clos networks with 4 x 4 middle switches.

91 Irregular Topology - Customized  Customized for an application  Usually a mix of shared bus, direct, and indirect network topologies 91 Example1: Reduced mesh Example 2: Cluster-based hybrid topology PE sw

92 Example 1: Example 1: Partially Irregular 2D- Mesh Topology 92  Contains oversized rectangularly shaped PEs.

93 Example 2: Irregular Mesh 93 This kind of chip does not limit the shape of the PEs or the placement of the routers. It may be considered a "custom" NoC

94 How to Select a Topology ?  Application decides the topology type  If PEs = few tens Star, Mesh topologies are recommended  If PEs = 100 or more Hierarchical Star, Mesh are recommended  Some topologies are better for certain designs than others 94


Download ppt "Network-on-Chip (1/2) Ben Abdallah, Abderazek The University of Aizu 1 KUST University, March 2011."

Similar presentations


Ads by Google