Download presentation
Presentation is loading. Please wait.
Published byBuddy Jefferson Modified over 9 years ago
1
Network-on-Chip (1/2) Ben Abdallah, Abderazek The University of Aizu E-mail: benab@u-aizu.ac.jp 1 KUST University, March 2011
2
Part 1 Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by NoC Traffic abstractions Data Abstraction Network delay modeling Part 1 Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by NoC Traffic abstractions Data Abstraction Network delay modeling 2
3
Application Requirements Signal processing o Hard real time o Very regular load o High quality Typically on DSPs Media processing o Hard real time o Irregular load o High quality SoC/media processors Multimedia o Soft real time o Irregular load o Limited quality PC/desktop Very challenging! 3
4
What the Internet Needs? High processing power Support wire speed Programmable Scalable Specially for network applications Increasing Huge Amount of Packets & Routing, Packet Classification, Encryption, QoS, New Applications and Protocols, etc….. General Purpose RISC (not capable enough) ASIC (large, expensive to develop, not flexible ) 4 MCSoC
5
Example - Network Processor (NP) 5 IBM PowerNP
6
Example - Network Processor (NP) 6 NP can be applied in various network layers and applications Traditional apps – forwarding, classification Advanced apps – transcoding, URL-based switching, security etc. New apps
7
Telecommunication Systems and NoC Paradigm The trend nowadays is to integrate telecommunication system on complex multicore SoC (MCSoC): Network processors, Multimedia hubs,and base-band telecom circuits These applications have tight time-to- market and performance constraints 7
8
Telecommunication Systems and NoC Paradigm Telecommunication multicore SoC is composed of 4 kinds of components: Software tasks Processors executing software Specific hardware cores, and Global on-chip communication network 8
9
Telecommunication Systems and NoC Paradigm Telecommunication multicore SoC is composed of 4 kinds of components: 1.Software tasks, 2.Processors executing software, 3.Specific hardware cores, and 4.Global on-chip communication network 9 This is the most challenging part
10
Technology & Architecture Trends Technology trends: Vast transistor budgets Relatively poor interconnect scaling Need to manage complexity and power Build flexible designs (multi-/general- purpose) Architectural trends: Go parallel ! 10
11
OperationDelay (.13mico) Delay (.05micro ) 32-bit ALU Operation650ps250ps 32-bit Register read325ps125ps Read 32-bit from 8KB RAM780ps300ps Transfer 32-bit across chip (10mm) 1400ps2300ps Transfer 32-bit across chip (200mm)2800ps4600ps 2:1 global on-chip communication to operation delay 9:1 in 2010 Ref: W.J. Dally HPCA Panel presentation 2002 Wire Delay vs. Logic Delay 11
12
Information transfer is inherently unreliable at the electrical level, due to: Timing errors Cross-talk Electro-magnetic interference (EMI) Soft errors The problem will get increasingly worse as technology scales down Communication Reliability 12
13
Evolution of on-chip Communication 13
14
Traditional SoC Nightmare Variety of dedicated interfaces Design and verification complexity Unpredictable performance Many underutilized wires 14 DMA CPUDSP Bridge IO C A B Peripheral Bus CPU Bus Control signals
15
Network on Chip: A paradigm Shift in VLSI 15 s s s s ss s s Module s From: Dedicated signal wiresTo: Shared network Point- To-point Link Network switch Computing Module
16
Network on Chip: A paradigm Shift in VLSI NINI DRAMDRAM switchswitch NINI CPUCPU AccelAccel NINI switchswitch switchswitch switchswitch NoC switchswitch switchswitch NINI DRAMDRAM CoprocCoproc I/OI/O NINI NINI NINI NINI NINI NINI NINI NINI DSPDSP DMADMA MPEGMPEG DMADMA EthntEthnt EthntEthnt 16
17
NoC Essential Communication by packets of bits Routing of packets through several hops, via switches Efficient sharing of wires Parallelism 17 s s s s ss s s Module s
18
NoC Operation Example 1. CPU request 2. Packetization and trans. 3. Routing 4. Receipt and unpacketization (AHB, OCP,... pinout) 5. Device response (if needed) 6. Packetization and transmission 7. Routing 8. Receipt and unpacketization CPUCPU I/OI/O Network Interface switchswitch switchswitch switchswitch 18
19
Characteristics of a Paradigm Shift Solves a critical problem Step-up in abstraction Design is affected: Design becomes more restricted New tools The changes enable higher complexity and capacity Jump in design productivity 19
20
Characteristics of a Paradigm shift Solves a critical problem Step-up in abstraction Design is affected: Design becomes more restricted New tools The changes enable higher complexity and capacity Jump in design productivity 20
21
Origins of the NoC Concept The idea was talked about in the 90’s, but actual research came in the new illenium. Some well-known early publications: Guerrier and Greiner (2000) “A generic architecture for on-chip packet-switched interconnections” Hemani et al. (2000) “Network on chip: An architecture for billion transistor era” Dally and Towles (2001) “Route packets, not wires: on-chip interconnection networks” Wingard (2001) “MicroNetwork-based integration of SoCs” Rijpkema, Goossens and Wielage (2001) “A router architecture for networks on silicon” De Micheli and Benini (2002) “Networks on chip: A new paradigm for systems on chip design” 21
22
Don't we already know how to design interconnection networks? Many existing network topologies, router designs and theory has already been developed for high end supercomputers and telecom switches Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs!! 22
23
Critical problems addressed by NoC 23 1) Global interconnect design problem: delay, power, noise, scalability, reliability 2 ) System integration productivity problem 3) Chip Multi Processors (key to power-efficient computing
24
1(a): NoC and Global wire delay 24 Long wire delay is dominated by Resistance Add repeaters Repeaters become latches (with clock frequency scaling ) Latches evolve to NoC routers NoC Router
25
1(b): Wire design for NoC 25 NoC links : Regular Point-to-point (no fanout tree) Can use transmission-line layout Well-defined current return path Can be optimized for noise /speed/power ? Low swing, current mode, ….
26
1(c): NoC Scalability 26 NoC: O(n) Point –to-Point O(n^2 √n) O(n √n) Simple Bus O(n^3 √n) O(n√n) Segmented Bus: O(n^2 √n) O(n√n) For Same Performance, compare the wire area and power
27
1(d): NoC and Communication Reliability 27 Router UMODEM UMODEMUMODEM UMODEMUMODEM Router UMODEM UMODEMUMODEM UMODEMUMODEM Router … Error correction Synchronization ISI reduction Parallel to Serial Convertor Modulation Link Interface Interconnect Input buffer m n Fault tolerance & error correction A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for NOC communications”, ICECS 2004
28
1(e): NoC and GALS Modules in NoC System use different clocks May use different voltages NoC can take care of synchronization NoC design may be asynchronous No waste of power when the links and routers are idle 28
29
2: NoC and Engineering Productivity NoC eliminates ad-hoc global wire engineering NoC separates computation from communication NoC supports modularity and reuse of cores NoC is a platform for system integration, debugging and testing 29
30
3: NoC and Multicore Uniprocessors cannot provide Power- efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited 30 Gate Interconnect Diff. Uniprocessor dynamic power (Magen et al., SLIP 200
31
3: NoC and Multicore Power-efficiency requires many parallel local computations Chip Multi Processors (Multicore) Thread-Level Parallelism (TLP) 31 Uniprocessor Performance Die Area (or Power)
32
3: NoC and Multicore Uniprocessors cannot provide Power-efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited Power-efficiency requires many parallel local computations Chip Multi Processors (CMP) Thread-Level Parallelism (TLP) Network is a natural choice for CMP! 32
33
3: NoC and Multicore Uniprocessors cannot provide Power-efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited Power-efficiency requires many parallel local computations Chip Multi Processors (CMP) Thread-Level Parallelism (TLP) Network is a natural choice for CMP! 33 Network is a natural choice for Multicore
34
Why Now is the time for NoC? 34 Difficulty of DSM wire design Productivity pressure Multicore
35
Traffic Abstractions Traffic model are generally captured from actual traces of functional simulation A statically distribution is often assumed for message 35 PE1PE2 PE3 PE4 PE12 PE11 PE5 PE9 PE7 PE8PE6 PE10
36
Data Abstractions 36
37
Layers of Abstraction in Network Modeling Software layers Application, OS Network & transport layers Network topology e.g. crossbar, ring, mesh, torus, fat tree,… Switching Circuit / packet switching(SAF, VCT), wormhole Addressing Logical/physical, source/destination, flow, transaction Routing Static/dynamic, distributed/source, deadlock avoidance Quality of Service e.g. guaranteed-throughput, best- effort Congestion control, end-to-end flow control 37
38
Layers of Abstraction in Network Modeling Data link layer Flow control (handshake) Handling of contention Correction of transmission errors Physical layer Wires, drivers, receivers, repeaters, signaling, circuits,.. 38
39
How to Select Architecture ? Architecture choices depends on system needs. 39 ASIC FPGA ASSP CMP/ Multicore Reconfiguration Rate During run time At boot time At design time Single application General purpose or Embedded systems Flexibility
40
How to Select Architecture ? Architecture choices depends on system needs 40 ASIC FPGA ASSP CMP/ Multicore Reconfiguration Rate During run time At boot time At design time Single application General purpose or Embedded systems Flexibility A large range of solutions!
41
Example: OASIS NoC 41 A. Ben Abdallah, and M. Sowa, "Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization", (TJASSST), Dec. 4-9th, 2006.Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization
42
Perspective 1: NoC vs. Bus Aggregate bandwidth grows Link speed unaffected by N Concurrent spatial reuse Pipelining is built-in Distributed arbitration Separate abstraction layers However: No performance guarantee Extra delay in routers Area and power overhead? Modules need NI Unfamiliar methodology 42 Bandwidth is shared Speed goes down as N grows No concurrency Pipelining is tough Central arbitration No layers of abstraction (communication and computation are coupled) However: Fairly simple and familiar NoC Bus
43
Perspective 2: NoC vs. Off-chip Networks Cost is in the links Latency is tolerable Traffic/applications unknown Changes at runtime Adherence to networking standards 43 Sensitive to cost: area power Wires are relatively cheap Latency is critical Traffic may be known a- priori Design time specialization Custom NoCs are possible Off-Chip Networks NoC
44
VLSI CAD Problems Application mapping Floorplanning / placement Routing Buffer sizing Timing closure Simulation Testing 44
45
VLSI CAD Problems in NoC Application mapping (map tasks to cores) Floorplanning / placement (within the network) Routing (of messages) Buffer sizing (size of FIFO queues in the routers) Simulation (Network simulation, traffic/delay/power modeling) Other NoC design problems (topology synthesis, switching, virtual channels, arbitration, flow control,……) 45
46
Typical NoC Design Flow 46 Place Modules Determine routing and adjust link capacities
47
Timing Closure in NoC 47 Define inter- module traffic Place modules Increase link capacities QoS satisfie d ? No Yes Finish Too long capacity results in poor QoS Too high capacity wastes area Uniform link capacities are a waste in ASIP system
48
NoC Design Requirements High-performance interconnect High-throughput, latency, power, area Complex functionality ( performance again) Support for virtual-channels QoS Synchronization Reliability, high-throughput, low-latency 48
49
Part 2 Routing algorithms Flow control schemes Clocking Schemes QoS Putting it all together 49
50
NoC Routing Algorithms They must prevent deadlock, livelock, and starvation Choice of a routing algorithm depends on: Minimizing power required for routing Minimizing logic and routing tables 50 Responsible for correctly and efficiently routing packets or circuits from source to destination
51
Routing Algorithm Classifications 3 different criteria: Where the routing decision are taken ? Source routing Distributed routing How a path is defined ? Static (deterministic) Adaptive The path length Minimal Nonminimal 51
52
NoC Routing-Table The Routing Table determines for each PE the route via which it will send packets to other PEs The routing table directly influences traffic in the NoC Here we can also distinguish between 2 methods: Static routing Dynamic (adaptive) routing 52
53
Static Routing The Routing Table is constant. The route is embedded in the packet header and the routers simply forward the packet to the direction indicated by the header The routers are passive in their addressing of packets (simple routers) 53
54
Dynamic Routing The routing table can change dynamically during operation Logically, a route is changed when it becomes slow due to other traffic Possibly out-of-order arrival of packets. Usually requires more virtual channels 54
55
Dynamic Routing More resources needed to monitor state of the network 55 321 213 X Adaptive routing scheme Packet
56
Routing Algorithms Requirements Routing algorithm must ensure freedom from deadlocks e.g. cyclic dependency shown below Routing algorithm must ensure freedom from livelocks and starvation 56
57
Flow Control Schemes - STALL/GO Low overhead scheme Requires only two control wires One going forward and signaling data availability the other going backward and signaling either a condition of buffers filled (STALL) or of buffers free (GO) 57
58
Repeaters are two-entry FIFOs No retransmission allowed SR FLIT REQ STALL FLIT REQ STALL FLIT REQ STALL Flow Control Schemes - STALL/GO 58
59
Flow Control Schemes - T-Error More aggressive scheme that can detect faults by making use of a second delayed clock at every buffer stage Delayed clock re-samples input data to detect any inconsistencies 59
60
Flow Control Schemes - ACK/NACK When flits are sent on a link, a local copy is kept in a buffer by sender When ACK received by sender, it deletes copy of flit from its local buffer When NACK is received, sender rewinds its output queue and starts resending flits, starting from the corrupted one 60
61
Repeaters are pure registers Buffering and retransmission logic in the sender Flow Control Schemes - ACK/NACK SR FLIT REQ ACK FLIT REQ ACK FLIT REQ ACK 61
62
ACK/NACK propagation Memory deallocation NACK Retransmission Go-back-N Transmission Example: ACK/NACK Flow Control ACK and buffering 62
63
Clocking Schemes Fully synchronous Single global clock is distributed to synchronize entire chip hard to achieve in practice, due to process variations and clock skew Mesochronous Local clocks are derived from a global clock Not sensitive to clock skew Pleisochronous clock signals are produced locally Asynchronous clocks do not have to be present at all 63
64
Clocking Schemes Mesochronous Local clocks are derived from a global clock Not sensitive to clock skew 64 PE SYNC NI SYNC SW CMU SYNC
65
Quality of Service (QoS) QoS refers to the level of commitment for packet delivery Three basic categories Best effort (BE) Only correctness and completion of communication is guaranteed Usually packet switched Guaranteed service (GS) makes a tangible guarantee on performance, in addition to basic guarantees of correctness and completion for communication Usually (virtual) circuit switched Differentiated service prioritizes communication according to different categories NoC switches employ priority based scheduling and allocation policies 65
66
Putting it all Together 66
67
Packet Format 67 Message, Packet and Flit Formats
68
Size: Phit, Flit, Packet There are no fixed rules for the size of phits, flits and packets Message: arbitrarily long Packets: restricted maximum length Typical values Phits: 1 bit to 64 bits Flits: 16 bits to 512 bits Packets: 128 bits to 1024 bits 68
69
NoC Queuing Schemes 69 1234 1324 1243 4231 CROSSBARCROSSBAR CROSSBARCROSSBAR Output 0 Output 1 Output 2 Output 3 Input 1 Input 0 Input 3 HOL blocking problem in Input Queuing
70
Flow Control Schemes - 70 Transmitter Stop Packet enable Receiver Stop & Go Flow Control Go threshold Stop threshold Buffer is occupied Buffer is released Minimum Buffer Size = Flit Size x ( R overhead + S overhead + 2 x Link delay) Roverhead: the required time to issue the stop signal at the received router Soverhead : the required time to stop sending a flit as soon as the stop signal is received
71
- Flow Control Schemes - 71 Transmitter Stop Packet enable Receiver Credit Based (CB) Flow Control Stop threshold Buffer is released Credit is decremented A flit is transferred Credit is incremented CB makes the best use of channel buffers Can be implemented regardless of the link length of the sender & receiver overhead
72
Queue and Buffer Design Effective bandwidth of Data link later is influenced by the traffic pattern and Q size Q Buffers consume most of the area and power among all NoC building blocks. 72 D Q sh Packet out Intermediate Empty Bubble Packet IN Conventional Shift Register Method
73
Network Interface Different interface shall be connected to the network The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol 73
74
Network Interface Open Core Protocol (OCP) Network Protocol core Network Interface Network PAYLOAD HEADER TAIL FLIT … Transmit – Access routing tables – Assemble packets – Split into flits Receive – Synchronize – Drop routing information 74
75
Bidirectional Link 2 * 32-bits data links Asynchronous, credit based flow control Easy floorplan routing & timing in DSM process 75 432 36 bits Tag Data
76
Router Design 76
77
Router Design with Cross Point 77 OQ1 Output Queues (Optional) 1234 IQ1 Switch Fabric Input Queues Output port Scheduler request grant 4-bit Grant ph-1:0 phit-1:0
78
Scheduler Design 78 PPE n Grant AnyGnt INCIENC log n Simple PE T.E. anyGnt n n Gnt n Termo encoder log n P_enc n Req n Round-roubin Algorithm circuits with 2 priority
79
Phit Size Determination The phit size is the bit width of a link and determines the switch area 79 PE NI phit size = packet size/SERR Operation freq = SERR* f NORM Switch Buffers PE NI phit size = packet size Operation freq = f NORM Switch Buffers SER/DES
80
Part 3 Network Topology Network-on-Chip topology Direct topology Indirect topology Irregular topology 80
81
NoC Topology Adopted from large-scale networks and parallel computing Topology classifications: Direct topologies Indirect topologies 81 The connection map between PEs
82
Network Topology A good topology allows to fulfill the requirements of the traffic at reasonable costs Network topology can be compared with a network of roads 82
83
Direct Topology Each switch (SW) is connected to a single PE As the # of nodes in the system increases, the total bandwidth also increases 83 1 PE is connected to only a single SW PE SW
84
Direct Topology - Mesh 2D mesh is most popular All links have the same length Eases physical design Area grows linearly with the the # of nodes 84 4x4 Mesh
85
Direct Topology - Torus and Folded Torus 85 Overcomes the long link limitation of a 2-D torus Links have the same size Similar to a regular Mesh Excessive delay problem due to long-end-around connection
86
Direct Topology - Octagon Messages being sent between any 2 nodes require at most two hops More octagons can be tiled together to accommodate larger designs 86 PE SW
87
Indirect Topology Fat tree Nodes are connected only to the leaves of the tree More links near root, where bandwidth requirements are higher 87 A set of PEs are connected to a switch. PE SW
88
Indirect Topology- k-ary n-fly butterfly network Blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs 88 Example: 2-ary 3-fly butterfly network
89
Indirect topology - (m, n, r) symmetric Clos network 3-stage network in which each stage is made up of a number of crossbar switches 89
90
Indirect topology - Benes Network 90 Example: (2, 2, 4) re-arrangeable Clos network constructed using two (2, 2, 2) Clos networks with 4 x 4 middle switches.
91
Irregular Topology - Customized Customized for an application Usually a mix of shared bus, direct, and indirect network topologies 91 Example1: Reduced mesh Example 2: Cluster-based hybrid topology PE sw
92
Example 1: Example 1: Partially Irregular 2D- Mesh Topology 92 Contains oversized rectangularly shaped PEs.
93
Example 2: Irregular Mesh 93 This kind of chip does not limit the shape of the PEs or the placement of the routers. It may be considered a "custom" NoC
94
How to Select a Topology ? Application decides the topology type If PEs = few tens Star, Mesh topologies are recommended If PEs = 100 or more Hierarchical Star, Mesh are recommended Some topologies are better for certain designs than others 94
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.