Enabling Class of Service for CIOQ Switches with Maximal Weighted Algorithms Thursday, October 08, 2015 Feng Wang Siu Hong Yuen
2 Contents 1. Motivation WFQ on OQ switches can provide service for different classes. Can we find maximal weight matching algorithms to provide service for different classes for CIOQ switches? 2. Bandwidth Metric 3. Simulation Environment 4. Algorithms used and their results 5. Intuition behind the result 6. Further work 7. Conclusion
3 Motivation We know that by using WFQ, we can provide service for different classes based on the priorities of the classes for OQ switches. However, OQ switches are impractical to implement because of the high memory bandwidth and fabric switch bandwidth required.
4 Motivation It is shown that with a speedup of 2, using stable marriage algorithm, CIOQ switches can emulate OQ switches. Can we find maximal matching algorithms that can provide service for different classes same as OQ switch with WFQ for a CIOQ switch at a speedup of S?
5 Contents 1. Motivation 2. Metric used WFQ as an ideal algorithm Using bandwidth as a quantative metric 3. Simulation Environment 4. Algorithms used and their results 5. Intuition behind the result 6. Further work 7. Conclusion
6 Metric used We used the WFQ algorithm implemented on OQ switches as the ideal algorithm to provide service for multiple classes. Thus, in order to measure the effectiveness of our algorithms, we need a quantitative metric to compare our algorithms against the WFQ algorithm.
7 Metric used Bandwidth metric measures whether the distribution of bandwidth that our algorithm produces is similar to that of WFQ. During a time period T, we observe the distribution of packets departing from the OQ (using WFQ) and the CIOQ (using our algorithm). Denote the number of class k packets departed from output port j of the OQ as X jk, and the number of class k packets departed from output port j of the CIOQ as Y jk.
8 Metric used For output port j, Bandwidth used by class k for the OQ x jk = X jk / T Bandwidth used by class k for the CIOQ y jk = Y jk / T Bandwidth metric we use: BDiff ranges from 0 to 1. The closer BDiff is to 0, the closer we are to emulating WFQ for OQ switches. T is chosen as the time taken for the WFQ algorithm to finish one round-robin cycle. `
9 Contents 1. Motivation 2. Metric used 3. Simulation Environment Simulator Switch configuration Traffic Sampling 4. Algorithms used and their results 5. Intuition behind the result 6. Further work 7. Conclusion
10 Simulation Environment Simulator: SIM v2.35 Switch: 8x8, 4 classes of service with weight 5:2:2:1 Traffic model: Bernoulli iid uniform Bernoulli iid nonuniform: overloaded traffic Bursty uniform Bursty nonuniform: overloaded traffic Same input traffic trace for OQ and CIOQ switches Sample the distribution of packets for port 0 each 10 time slots
11 Contents 1. Motivation 2. Metric used 3. Simulation Environment 4. Algorithms used and their results algo0 to algo4 5. Intuition behind the result 6. Further work 7. Conclusion
12 Algorithms We came up with 5 maximal weight matching algorithms that attempt to provide service for multiple classes. They are based on the request-grant- accept phases similar to iSLIP. Each VOQij is split into P sub-queues, each sub-queue stores the packet for a class
13 algo0 algo0 is the most basic algorithm out of the 5 algorithms upon which the subsequent algorithms build on. Algo0 is a variation of PIM with support for different priorities. Request: For each output j that input i has a packet for, it requests that output with weight = 1. Grant: If output j receives any requests, it determines the request with the largest weight (all the same in this case). If multiple requests are the same largest weight, ties are broken randomly. Accept: If input i receives any grants, it determines the grant with the highest weight (all the same in this case). If multiple requests are the same largest weight, ties are broken randomly.
14 algo1 algo0 does not differentiate between different requests, i.e. all requests are treated equally. algo1 improves on that by associating a weight with each request. For each VOQij, we calculate W ij k = weight of class k x amount of time a class k packet has waited at the HoL. Then we take the maximum W ij k over all k classes for this VOQ and assign this as W ij, the weight of the request from input i to output j
15 algo1 The rest of the algorithm is the same as algo0. Request: For each output j that input i has a packet for, it requests that output with weight = W ij. Grant: If output j receives any requests, it determines the request with the largest weight. If multiple requests are the same largest weight, the ties are broken randomly. Accept: If input i receives any grants, it determines the grant with the largest weight. If multiple requests are the same largest weight, ties are broken randomly.
16 algo2 and algo3 For algo0 and algo1, during the grant and accept phases, ties are broken randomly. This does not take into consideration which request was granted/accepted previously. algo2 and algo3 improves on algo0 and algo1 by remembering previous matches in a similar way to iSLIP For each output, we keep a pointer to the last accepted grant input for every priority. For each input, we keep a pointer to the last accepted output for every priority.
17 algo2 algo2 is algo0 with the pointer enhancement. Request: For each output j that input i has a packet for, it requests that output with weight = 1. Grant: If output j receives any requests, it determines the request with the highest weight (all the same in this case). Ties are broken first by priority. If multiple requests are of the same priority, we do the following: the output last granted for this priority is least preferred. We then grant the output that is most preferred (in the round-robin definition). Accept: If input i receives any grants, it determines the grant with the highest weight (all the same in this case). If there are ties, we first select a priority to accept randomly. If there are multiple grants with this priority, we do the following: the input last accepted for this priority is least preferred. We then accept the input that is most preferred (in the round-robin definition).
18 algo3 algo3 is algo1 with the pointer enhancement. Request: For each output j that input i has a packet for, it requests that output with weight = W ij. Grant: If output j receives any requests, it determines the request with the highest weight. Ties are broken first by priority. If multiple requests are of the same priority, we do the following: the output last granted for this priority is least preferred. We then grant the output that is most preferred (in the round-robin definition). Accept: If input i receives any grants, it determines the grant with the highest weight. If there are ties, we first select a priority to accept randomly. If there are multiple grants with this priority, we do the following: the input last accepted for this priority is least preferred. We then accept the input that is most preferred (in the round-robin definition).
19 algo4 algo2 and algo3 rotate the pointer for the preferred input port to grant and the preferred output port to accept. Instead of having a pointer that rotates regularly, algo4 tries to rotate the preference depending on the weight of each class. It attempts to rotate the pointer similar to WFQ, where the pointer stays at a particular preferred port depending on the schedule determined by WFQ.
20 algo4 Request: For each output j that input i has a packet for, it requests that output with a bitmap showing which priority has a packet. Grant: Output j maintains a preferred priority, which is updated in a way similar to WFQ for accepted grant request. Assume the preferred priority for output j is k. Output j checks all the received requests has the packet with priority k. If multiple inputs have priority k packets, ties are broken randomly. If no input has priority k packets, the output j updates its preferred priority to the next one. Accept: Input i also maintains a preferred priority, which is updated in a way similar to WFQ for accepted request. Assume the preferred priority for input i is k. If input i receives any grants, it finds the grant with priority k. If multiple grants have priority k, ties are broken randomly. If no grant has priority k, the the preferred priority is updated to the next one.
21 Result: Bernoulli iid uniform
22 Result: Bernoulli iid nonuniform
23 Result: Bursty uniform
24 Result: Bursty nonuniform
25 Results: In most of the cases, algo1 is better than algo0, algo3 is better than algo2 algo3 is not always better than algo1 algo4 is not always better than algo1 and algo3 When speedup increases, the results are getting close for different algorithms. For speedup > 4, the BDiff = 0
26 Content 1. Motivation 2. Bandwidth Metric 3. Simulation Environment 4. Algorithms used and their results 5. Conclusion Weight information is helpful Size of matching is not helpful WFQ on both input and output side is not helpful Speedup for BDiff = 0 6. Further work 7. Conclusion
27 Intuition behind the result Adding the weight information in the algorithms helps the scheduler to make the better decision for serving different classes. Compared with algo0 and algo1, algo2 and algo3 improve the size of the matching because they desynchronize the grants to different ports. However, we observed that algo2 and algo3 did not improve the BDiff metric. So the size of the matching does not help for serving different classes. Implement WFQ on both input and output port to select grants and accepts does not help to make the better decision. Intuitive thinking: WFQ on output side may help to make better decisions, but we could perhaps shall use other criteria to break ties on the input side.
28 Intuition behind the result BDiff = 0 for Speedup > 4. 4 is the number of classes in our test. So maybe with Speedup > number of classes, BDiff=0. However, we did a couple of tests for number of classes = 5, BDiff = 0 for speedup > 4 is still hold.
29 Content 1. Motivation 2. Bandwidth Metric 3. Simulation Environment 4. Algorithms used and their results 5. Intuition behind the result 6. Further work Latency metric Existence of a constant Speedup S for BDiff = 0? 7. Conclusion
30 Future work Besides the bandwidth allocated to different classes of service, the latency is another metric to measure how good the algorithm is. Define the metric for latency as how close the latency of the packets for different classes is to OQ switch, measure the latency metrics for different algorithms. Investigate more on whether exist a constant speedup S, CIOQ switch can emulate OQ WFQ for the service rate for different classes. Need more theoretical analysis
31 Conclusion We define the metric to evaluate the capability of algorithms to provide class of service. The metric is measured for different algorithms. The result suggests that the weight information in selecting grants and accepts is helpful for smaller speedup. When speed up increases, the difference for different algorithm is not obvious. So there is a trade off between simple algorithm or speedup. Among all the algorithms we tried, algo1 is good enough to provide a good service rate for different classes. Algo3 and Algo4 does not improve from algo1. It’s possible to find a maximal matching algorithm with certain speedup for CIOQ switch to emulate OQ WFQ for the service rate of different classes