1 Analysis of Worst-case Delay Bounds for Best-effort Communication in Wormhole Networks on Chip Yue Qian 1, Zhonghai Lu 2, Wenhua Dou 1 1 School of Computer.

Slides:



Advertisements
Similar presentations
February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.
Advertisements

Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Prof. Natalie Enright Jerger
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
Ch. 12 Routing in Switched Networks Routing in Packet Switched Networks Routing Algorithm Requirements –Correctness –Simplicity –Robustness--the.
Delay Analysis and Optimality of Scheduling Policies for Multihop Wireless Networks Gagan Raj Gupta Post-Doctoral Research Associate with the Parallel.
Reducing Network Energy Consumption via Sleeping and Rate- Adaption Sergiu Nedevschi, Lucian Popa, Gianluca Iannaccone, Sylvia Ratnasamy, David Wetherall.
Hadi Goudarzi and Massoud Pedram
A Novel 3D Layer-Multiplexed On-Chip Network
VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Winter 2004 UCSC CMPE252B1 CMPE 257: Wireless and Mobile Networking SET 3f: Medium Access Control Protocols.
A Centralized Scheduling Algorithm based on Multi-path Routing in WiMax Mesh Network Yang Cao, Zhimin Liu and Yi Yang International Conference on Wireless.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
LOGO Video Packet Selection and Scheduling for Multipath Streaming IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 3, APRIL 2007 Dan Jurca, Student Member,
An Analytical Model for Worst-case Reorder Buffer Size of Multi-path Minimal Routing NoCs Gaoming Du 1, Miao Li 1, Zhonghai Lu 2, Minglun Gao 1, Chunhua.
Analytical Modeling and Evaluation of On- Chip Interconnects Using Network Calculus M. BAkhouya, S. Suboh, J. Gaber, T. El-Ghazawi NOCS 2009, May 10-13,
EE 685 presentation Optimal Control of Wireless Networks with Finite Buffers By Long Bao Le, Eytan Modiano and Ness B. Shroff.
Kuang-Hao Liu et al Presented by Xin Che 11/18/09.
1 Complexity of Network Synchronization Raeda Naamnieh.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
End-to-End Analysis of Distributed Video-on-Demand Systems Padmavathi Mundur, Robert Simon, and Arun K. Sood IEEE Transactions on Multimedia, February.
Generalized Processing Sharing (GPS) Is work conserving Is a fluid model Service Guarantee –GPS discipline can provide an end-to-end bounded- delay service.
Multiple constraints QoS Routing Given: - a (real time) connection request with specified QoS requirements (e.g., Bdw, Delay, Jitter, packet loss, path.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Switching, routing, and flow control in interconnection networks.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 Cooperative Wireless.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Computer Networks Performance Metrics. Performance Metrics Outline Generic Performance Metrics Network performance Measures Components of Hop and End-to-End.
1 Multicast Algorithms for Multi- Channel Wireless Mesh Networks Guokai Zeng, Bo Wang, Yong Ding, Li Xiao, Matt Mutka Michigan State University ICNP 2007.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Network-Coding Multicast Networks With QoS Guarantees Yuanzhe Xuan and Chin-Tau Lea, Senior Member, IEEE IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19,
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Packet switching network Data is divided into packets. Transfer of information as payload in data packets Packets undergo random delays & possible loss.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Packet Scheduling: SCFQ, STFQ, WF2Q Yongho Seok Contents Review: GPS, PGPS SCFQ( Self-clocked fair queuing ) STFQ( Start time fair queuing ) WF2Q( Worst-case.
Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Virtual-Channel Flow Control William J. Dally
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
Data Communication Networks Lec 13 and 14. Network Core- Packet Switching.
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
On-Time Network On-chip
Data Communication Networks
On-time Network On-chip
Networked Real-Time Systems: Routing and Scheduling
Presentation transcript:

1 Analysis of Worst-case Delay Bounds for Best-effort Communication in Wormhole Networks on Chip Yue Qian 1, Zhonghai Lu 2, Wenhua Dou 1 1 School of Computer Science, National University of Defense Technology, China 2 Dept. of Electronic, Computer and Software Systems, Royal Institute of Technology (KTH), Sweden The 3rd ACM/IEEE International Symposium on Networks-on-Chip May 10-13, 2009, San Diego, CA

2 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

3 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

4 Introduction (1/4) The provision of Quality-of-Service (QoS) has been a major concern for Networks-on-Chip (NoC).  Routing packets in resource-sharing networks creates contention and thus brings about unpredictable performance. A packet-switched networks may provide best-effort (BE) and guaranteed services to satisfy the requirements of different QoS provisions. Compared to guaranteed service, BE networks make a good utilization of the shared network resources and achieve good average performance.

5 Introduction (2/4) The worst-case performance is extremely hard to predict in BE networks.  Network contention for shared resources (buffers and links) includes not only direct but also indirect contention;  Identifying the worst case is nontrivial;  The existence of cyclic dependency between flit delivery and credit generation in wormhole networks with credit- based flow control further complicates the problem. The simulation based approach can offer the highest accuracy but can be very time-consuming. In contrast, a formal-analysis-based method is much more efficient.

6 Introduction (3/4) In general queuing networks, network calculus provides the means to deterministically reason about timing properties of traffic flows. [1] C. Chang, “Performance Guarantees in Communication Networks,” Springer-Verlag, [2] J.-Y. Le Boudec and P. Thiran, “Network Calculus-A Theory of Deterministic Queuing Systems for the Internet,” Springer-Verlag, vol. 2050, data time b T r R Delay bound Backlog bound Arrival curve Service curve Traffic flow  Based on the powerful abstraction of arrival curve for traffic flows and service curve for network elements (routers, servers), it allows computing the worst-case delay and backlog bounds.  Systematic accounts of network calculus can be found in books [1][2]. Figure 1. Arrival curve and service curve

7 Introduction (4/4) In this paper, based on network calculus, we aim for deriving the worst-case delay bounds for individual flows in on-chip networks.  We first analyze the resource sharing in routers, and then build analysis models for different resource sharing components. Based on these models, we can derive the equivalent service curve a router provides to an individual flow.  To consider the contention a flow may experience along its routing path, we classify and analyze flow interference patterns. Such interferences can be captured in a contention tree model. Based on this model, we can derive the equivalent service curve the tandem of routers provides to an individual flow.  With a flow’s arrival curve known and its equivalent service curve obtained, we can compute the flow’s delay bound.

8 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

9 The Wormhole Network Portion of a wormhole network with two nodes.  A node contains a core and a router, which are connected via a network interface (NI);  The router contains one crossbar switch, one buffer per inport, and a credit counter for flow control;  At the link level, the routers perform credit-based flow control;  There exists an one-to-one correspondence between flits and credits, meaning that delivering one flit requires one credit, and forwarding one flit generates one credit. Figure 2. Portion of a wormhole network

10 Assumptions A flow is an infinite stream of unicast traffic (packets) sent from a source node to a destination node.  Flow is denoted as ;  represents an aggregate flow which is composition of flows and. The network performs deterministic routing, which does not adapt traffic path according to the network congestion state but is cheap to implement in hardware.  This means that the path of a flow is statically determined. While serving multiple flows, the routers employ weighted round-robin scheduling to share the link bandwidth. The switches use FIFO discipline to serve packets in buffers.

11 Three Types of Resource Sharing Control sharing (flow control sharing)  Routers share and use the status of buffers in the downstream routers to determine when packets are allowed to be forwarded. Link sharing  Multiple flows from different buffers share the same outport and thus the output link bandwidth. Buffer sharing  An aggregate flow, which are to be split, share a buffer. Figure 3. (a) Link sharing(b) Buffer sharing

12 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

13 We virtualize the functionality of flow control as a network element, flow controller, which provides service to traffic flows.  Due to the existence of cyclic dependency between flit delivery and credit generation, we can not directly apply network calculus analysis techniques because they are generally applicable to forward networks (networks without feedback control).  This enables us to derive its service curve and transform the closed-loop network into an open-loop one. Analysis of Credit-based Flow Control (1/2) We consider a traffic flow f passing through adjacent routers and construct an analytical model with the network elements depicted in Figure 4(a). Figure 4. The flow control analytical model for flow f traversing adjacent routers.

14 Analysis of Credit-based Flow Control (2/2) We give a theorem to derive the service curve for the flow controller 1 and router 1. After obtaining the service curves of flow controller 1 and router1, we can transform the closed-loop model to the forward one depicted in Figure 4(b), where the cyclic dependency caused by the feedback control is resolved (“eliminated”). Figure 4. The flow control analytical model

15 Without losing generality we consider two flows f1 and f2 share one output link. The router they traverse is abstracted as the combination of a switch plus a flow controller depicted in Figure 5(a) and guarantees the service curve. Since the router performs the weighted round-robin scheduling, the flows are served according to their configured weight, for flow. The equivalent service curves both flows receive are illustrated in Figure 5(b). Analysis of Link Sharing Figure 5. (a) Two flows f1 and f2 share one output link; (b) The equivalent service curve for guaranteed by the router.

16 As drawn in Figure 6(a), an aggregate flow sharing the same input buffer is to be split to different outports. We get the service curve of the router for as. The equivalent service curve for an individual flow depends also on the arrival curve of its contention flows at the ingress of the buffer.  For, the equivalent service curve can be derived as, where is a function to compute the equivalent service curve. Analysis of Buffer Sharing

17 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

18 The Buffer-Sharing Analysis Model A router serves flows performing the three sharings concurrently. Combining the three models, we can obtain a simplified analysis model, which “eliminates” the feedback and link contention and keeps only the buffer sharing. This model is called buffer-sharing analysis model/network.  For the buffer sharing, the equivalent service curve for each individual flow depends also on the arrival curve of its contention flows, and can not be separated in general.  This simplification procedure can be viewed as a transformation procedure. The transformation steps can be generalized as four steps:  (1) Build an initial analysis model taking into account of flow control, link sharing and buffer sharing;  (2) Based on the model in step (1), “eliminate” (resolve) flow control;  (3) Based on the model in step (2), “eliminate” link sharing;  (4) Based on the model in step (3), derive a buffer-sharing analysis model.

19 Interference Patterns and Analytical Models In a buffer-sharing analysis network, flow contention scenarios are diverse and complicated.  We call the flow for which we shall derive its delay bound tagged flow, other flows sharing resources with it contention or interfering flows.  A tagged flow directly contends with interfering flows. Also, interfering flows may contend with each other and then contend with the tagged flow again. To decompose a complex contention scenario, we identify three basic contention or interference patterns, namely, Nested, Parallel and Crossed. We analyze the three scenarios and derive their analytical models with focus on the derivation of the equivalent service curve the tandem provides to the tagged flow. Figure 7. The three basic contention patterns for a tagged flow.

20 The General Analysis Procedure Step 1: Construct a buffer-sharing analysis network that resolves the feedback control and link sharing contentions using the transformation steps. Step 2: Given a tagged flow, construct its contention tree [3] to model the buffer sharing contentions produced by interfering flows in the buffer-sharing analysis network.  Step 2.1: Let the tandem traversed by the tagged flow be the trunk;  Step 2.2: Have the tandems traversed by the interfering flows before reaching a trunk node as branches; A branch may also have its own sub-branches. Step 3: Scan the contention tree and compute all the output arrival curves of flows traversing the branches using the basic interference analytical models iteratively. Step 4: Compute the equivalent service curve for the tagged flow and derive its delay bound. [3] Y. Qian, Z. Lu, and W. Dou. Analysis of communication delay bounds for network on chips. In Proceedings of 14th Asia and South Pacific Design Automation Conference, Jan

21 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

22 An Example Figure 8 shows a network with 16 nodes. There are 3 flows, f1, f2 and f3.  f1 is from MIPS1 to RAM1, f2 from MIPS2 to RAM2 and f3 from MIPS3 to RAM3. We derive the delay bound for f1. Thus f1 is the tagged flow and f2 and f3 are contention flows. In the following, we detail the analysis steps. Figure 8. A 4×4 mesh NoC.

23 Step 1: Build a buffer-sharing analysis network The initial closed-loop analysis network is shown in Figure 9(a). This network can be simplified into a forward buffer- sharing analysis network, as depicted in Figure 9(b). Figure 9. (a) An initial closed-loop analysis network; (b) A buffer-sharing analysis network.

24 Step 2: Construct a contention tree We build a contention tree for f1 as drawn in Figure 10. It shows how flows pass routers, and how they contend for shared buffers.  At router R7, f1 and f2 share buffer B7;  At router R15, f1 shares buffer B15 with f3;  At router R10, two contention flows f2 and f3 share buffer B10. Figure 10. Contention tree for tagged flow f1.

25 Step 3: Compute output arrival curves of branch flows.  To derive the equivalent service curve for trunk flow f1, we scan the contention tree using Depth-First-Search scheme. Step 4: Compute the delay bound.  After all arrival curves of injected flows to the trunk are obtained, we can compute the trunk service curve for f1 as  Thus the delay bound for f1 can be derived as where is the function to compute the maximum horizontal distance between the arrival curve and the service curve. Step 3 & 4

26 Assuming the affine arrival curve for flows and latency-rate service curve for routers, we can obtain closed-form formulas for the delay bound calculation.  The arrival curve of is ;  The switch service curve ;  The buffer size of each router equals to ;  Each flow has an equal weight for link sharing, i.e., Case 1: When, the least upper delay bound for flow f1 is Case 2: Analogously, we can compute the delay bound for flow f1 when Closed-Form Formulas

27 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

28  We run three embedded multimedia programs simultaneously on the platform, specifically, an MP3 audio decoder on MIPS1, a JPEG decoder on MIPS2 and an MPEG2 video decoder on MIPS3, generating three flows, f1, f2 and f3, respectively.  We analyzed all the three application traces and derived their affine arrival curves.  Routers are uniform with a per-link service rate C of 1flit/cycle, delaying 5 cycles to process head flits (T=5) and switching flits in one cycle.  The routers use a fair weight for each flow, i.e., flit (i=1,2,3) for the round-robin link scheduling.  The buffer size varies from 3 to 6 flits.  We also synthesize three traffic flows according to the affine arrival curves derived by real traces and run them in the same experimental platform. We shall compare the simulated results of the real traces and their corresponding synthetic traffic flows. Simulation Setup We use a simulation platform in an open source simulation environment SoCLib [4] as shown in Figure 11 to collect application traces and to simulate their delays in on-chip networks. Figure 11. The simulation platform. [4] SoCLib simulation environment. On-line, available at

29 Analysis and Simulation Results We consider f1, f2 and f3 as the tagged flow each time and derive their delay bound using the proposed analytical approach. We can observe from Table 2:  In all cases, calculated delay bound > simulated delay for synthetic traffic > simulated delay for real traffic.  The calculated delay bounds are fairly tight.  As the flow control buffer size increases, the delay bounds and corresponding maximum observed delays decrease until an optimal buffer size is reached. “B=5” is optimal in this example.

30 Outline Introduction Resource Sharing in Wormhole Networks Analysis of Resource Sharing The Delay Bound Analysis Technique A Delay-Bound Analysis Example Experimental Results Conclusions

31 Conclusions In this work, we present a network-calculus based analysis method to compute the worst-case delay bounds for individual flows in best-effort wormhole networks with the credit-based flow control. Our simulation results with both real on-chip multimedia traces and synthetic traffic validate the correctness and tightness of analysis results. We conclude that our technique can be used to efficiently compute per-flow worst-case delay bound, which is extremely difficult to cover even by exhaustive simulations. Our method is topology independent, and thus can be applied to various networks with a regular or irregular topology.

32 Future Work We have considered wormhole networks where a router contains only one virtual channel per port. We shall extend our analysis to general wormhole networks where a router has multiple virtual channels per port.  The analysis technique remains the same. However, we need to take into account the allocation of virtual channels in our analysis due to the existence of multiple virtual channels. We will also extend our framework to consider other link sharing algorithms. Furthermore, we will automate the analysis procedure.

33 Any Questions? Thank you very much!