MOZART: Temporal Coordination of Measurement (SOSR’ 16) Xuemei Liu, Meral Shirazipour, Minlan Yu, Ying Zhang Thanks for the introduction. Today I am going to show our work … This is a collaborative work among, university of Southern California, Ericsson and HP Labs.
Measurement in data center Incentive examples of measurement Fault diagnosis: Capture root causes for failures. Traffic engineering: Capture statistics for big flows. Attack detection: Capture signatures of attacks. Essence of measurement Capture data related to events. Measurement is important in dc, for examples. The first purpose is fault diagnosis. If the event of … happens, we need to collect enough logs to diagnosis the root cause The second purpose is traffic engineering. If the event of uneven distribution of traffic happens in the data center, we need to collect the flow volume among all paths to analyze which flows are not distributed evenly. The third purpose is attack detection. If the event of attack happens in the network, we need to find out which servers are compromised and attacking other servers. The essence of measurement is …. The essence of measurement is in fact capture data related to events.
Different views/abilities of devices View: per source/destination traffic Abilities: end-2-end loss, latency, etc. View: per link traffic Abilities: per link volume, latency, etc. switches We observe that, different …. The hosts … the switches … hosts
No-coordination of measurement Controller Too much reporting overhead Due to different view…, the controller needs to collect all flow data through all devices, aggregate the data, analyze what events are happening in the network, and also find out the root cause for the events. !!!! This is the traditional way of doing measuring in data center. !!!It requires no-coordination between devices!!! but there are very big problems of no-coordination of measurement. The first problem is that, …, as devices need to … The second problem is that …, as the devices need to report all … In order to solve these two problems, we propose …. I will talk about some examples to compare with no-coordination measurement and temporal coordination measurement, and show how temporal coordination solve these two problems … We propose temporal coordination of measurement Limited resource may be utilized by flows not related to the event.
Example1 – loss detection Packet loss affects performance. Operators want to locate the loss. Measure & report flow volume of all flows Measure & report loss of all flows The first example is single path loss. Packet loss affects … Operators … [only talk about loss in switches] In this example, there is traffic through one path, and suppose high loss is happening among this path. Today, without coordination, the sender needs to … switches … This is really bad as the memory size in switches in small, but it needs to measure volume of all flows through it. S0 S1 S2 Traffic flow No-coordination
Example1 - loss detection Packet loss affects performance. Operators want to locate the loss. Measure & report flow volume of only lossy flows Detect high loss for some flows However, if the sender can detect which flows are suffering high loss, and send … to the switches. The switches …. Temporal coordination happens between the sender and switches, and the sender … with switches. [no much benefits, leave to behind] S0 S1 S2 Traffic flow Selected flows Sender needs to coordinate the lossy flows with switches. Coordination
Count & report number of destinations Example2 - port scan Compromised servers detect vulnerable servers. Count & report number of destinations for all senders Port: 456 The second example is … Suppose there is a compromised …, it is trying to attacking other servers by accessing random ports. In order to detect the compromised server, for no-coordination …, the ingress switches needs to counter …., and report it to the controller. The controller can then decide which servers are compromised. Port: 123 S0 S1 Compromised sever No-coordination Port: 789 Traffic flow
Example2 - port scan Compromised servers detect vulnerable servers. Detect senders with unwanted traffic sent to secure ports Count & report number of destinations for detected sender Port: 456 However, with coordination, some interesting things will happen. For example, suppose rightmost server is a http…. The egress switches … then it can tell the ingress …, then the ingress switches just needs to … Temporal coordination happen between egress switch and ingress …, and egress switch … Port: 123 S0 S1 Http server (80) Compromised sever Egress switch coordinates candidate compromised senders with ingress switch Coordination Port: 789 Traffic flow Selected flows
Example3 - ECMP flow Facebook reported congestion caused by unbalanced ECMP traffic distribution. Measure & report volume of all flows S0 The third example is ECMP flow. In 2014, fb reported congestion… Even re-configuring the hashing algorithm of selecting ECMP paths cannot solve the problem, and this problem kept happening for more than two years. Here we discuss an simplified version of the problem, … suppose the traffic of some large flows is not distributed evenly among the 3 paths. In no-coordinationn of measurement, … S1 No-coordination S2 Traffic flow
Example3 - ECMP flow Facebook reported congestion caused by unbalanced ECMP traffic distribution. Measure & report volume of elephant flows Detect elephant flows S0 S1 Switches coordinate elephant flows with each other Coordination S2 Traffic flow Selected flows
MOnitor flowZ At the Right Time MOZART MOnitor flowZ At the Right Time In order to support temporal coordination, we propose MOZART, which is MOnitor flowZ At the Right Time.
MOZART framework MOZART controller monitor selector selector monitor Capture data related to events Configure Detect events monitor selector Report data of selected flows [comment] 1. Selector, monitor, their roles; 1.1, controller configures and collect data 2.1. coordination algorithms 2.2. coordination between selector and monitors. 2.3 placement algorithms. selector monitor Selected flows
MOZART design challenges Coordination measurement Placement of MOZART tasks Due to time limitation, We have c m for one tasks, and placement algorithm for many tasks.
MOZART design challenges Coordination measurement Placement of tasks
Strawman Coordination TIME f1 satisfies the event f1 in Selector: … f1 is selected What to measure? What is metric? what are missing? How much is missing? More concert. Monitor counts packet number. f1 in Monitor: … Normal packet
Strawman Coordination TIME f1 satisfies the event f1 in Selector: … f1 is selected Also, by the time a flow is detected, the traffic in the monitor may be already gone. f1 in Monitor: … … Traffic before selected is not captured Normal packet Captured packet
Two-mode Coordination Normal Mode Event Mode TIME f1 satisfies the event f1 in Selector: … f1 is selected Sampling in Normal Mode [end] talk about there are several f1 in Monitor: … … Normal packet Traffic before selected has a chance to be captured. Captured packet Sampled packet
Memory management in monitors Selected flows, non-selected flows coexist in hash table. Limited memory in devices. Collision may happen in hash table. Selected flows f7 [comment] We have good benefit of two modes, so how to support in monitors. Because devices have limited memory resource for monitoring, the size of hashtable used to store flow statistics is limited. When two flows are trying to occupy the same entry, collision happens. Thus, we design the memory management mechanism to utilize the limited memory in a better way. Flow ID f1 f2 f3 Selected flow? 1 1 Flow statistics 10240 2048 500
Memory management in monitors Selected flows, non-selected flows coexist in hash table. Limited memory in devices. Collision may happen in hash table. Selected flows f7 Because devices have limited memory resource for monitoring, the size of hashtable used to store flow statistics is limited. When two flows are trying to occupy the same entry, collision happens. Thus, we design the memory management mechanism to utilize the limited memory in a better way. Flow ID f1 f2 f7 Selected flow? 1 1 1 Flow statistics 10240 2048 1024
Memory management in monitors Selected flows, non-selected flows coexist in hash table. Limited memory in devices. Collision may happen in hash table. Non-selected flows Selected flows f5 f6 f7 [comment] walk through the table, different rows, and f7=>f3 first. Then f5=>f1, then f6=>f2. Because devices have limited memory resource for monitoring, the size of hashtable used to store flow statistics is limited. When two flows are trying to occupy the same entry, collision happens. Thus, we design the memory management mechanism to utilize the limited memory in a better way. Flow ID f1 f2 f7 More memory is allocated to selected flows. Selected flow? 1 1 1 Flow statistics 10240 2048 1024
MOZART design challenges Coordination measurement Placement of MOZART tasks Now we introduce how to support one measurement tasks in MOZART. But in reality, there are many tasks to run, and we need to mange these many tasks in the network.
Placement of MOZART tasks Many candidate MOZART tasks to run Operators want to detect many events. Device Resource Constraints Switches: limited memory; Hosts: limited CPU. Measurement can just use leftover resources. Latency constraint within one MOZART task Timely communication is critical. Latency between selectors/monitors should be small. [this slide] say tens Mbs at most in switches. There are many … However, there are resource constraints in devices … Also, we notice that within one MOZART task, there is latency constraint between selectors and monitors. [explain] Mea
Placement of MOZART tasks Strawman algorithm Maximize Allocated Modules (MAM). Challenges One task - Selectors and monitors should all be placed. Multiple tasks - Joint placement to max running tasks. MOZART- Binary Integer Linear Programming Objective - Maximize the number of tasks to run. Subject to resource and latency constraints. [comment]: A strawman placement algorithm is …. There are two constrains: The first is devices … , The second is we need timely communication between selectors and monitors, thus, the latency between them should be small. In MOZART, we designed a binary integer linear programming solution. The objective is …., and we also meet the resource constraints in …, and the time constraints in each task.
Evaluation Setup Topology & Traffic Compared algorithms B4 topology (12 switches, 12 hosts). Implemented in Mininet. Switches run Open vSwitch. 2 hours Caida trace. Compared algorithms No-coordination - Just Sample and Hold (SH) in monitors. Coordination - Selectors sends selected flows; SH in monitors. [go to content directly] Now I will talk about our evaluations, first let’s discuss the evaluation setup. We setup the B4 topology in Mininet, which contains 12 switches and 12 data centers. The switches run open vswitch, and we add our coordination feature and measurement feature in user mode of open vswitch. IGNORE: Shortest path routing algorithm is used to forward packets in switches, and ECMP paths are applied if there are equal cost multiple paths. Some other points to talk is that we use 2 hours … . IGNORE: Multiple tasks … About the sampling techniques in normal mode, we use SH. SH is an efficient algorithm to capture large flows … We compare with a no-coordination algorithm, which is just running SH in monitors.
Example – loss detection measure flow volume of lossy flows High loss for some flows monitor monitor monitor S0 S1 S2 selector Traffic flow Selected flows from selector
MOZART achieves high accuracy Ratio of selected flows not captured [comments] x axis, y axis, and show the numbers. Show the example first, and say monitor is in the switches. [add 2M bytes points.] Strengthen reduce from 15% to 1.3% [end] we also run other examples in the testbed, but the achievement of MOZART is similar. 15% 1.3% Memory size in each monitor for measurement
MOZART supports more tasks Algorithms tasks assigned(%) Avg. latency(ms) Maximize Allocated Modules 77% 94 MOZART (Latency <= infinite) 100% 110 [comment:] wre [comment]: We fix the memory size in devices, and try to allocate ? Tasks in the network. Talk about the setup if time is enough. compare mozart with MAM, put MAM first, saying more tasks, but larger latency. If we add more
MOZART supports more tasks Algorithms tasks assigned(%) Avg. latency(ms) Maximize Allocated Modules 77% 94 MOZART (Latency <= infinite) 100% 110 (Latency <= 250ms) 98% 64 [comment:] wre [comment]: We fix the memory size in devices, and try to allocate ? Tasks in the network. Talk about the setup if time is enough. compare mozart with MAM, put MAM first, saying more tasks, but larger latency. If we add more
Conclusion Temporal coordination is important MOZART design highlights Collect data related to events. Different views/abilities of devices. MOZART design highlights Coordination algorithms. Placement algorithm for maximizing tasks to run. Benefits High measurement accuracy. Support more tasks. Meet memory constraints in devices. In order to support temporal coordination, MOZART has three design highlights.
Communication between selectors and monitors Same path Tag following packets of selected flows. Reverse path Tag reverse packets of selected flows. Different path Send explicit packets. [comment] merge 18&19. Add the tradeoff with previous slide. The second challenges is the communication between selectors and monitors. The first point is that Communication is necessary, as selectors and monitors could locate in different devices, and the selectors needs to notify monitors which flows are selected. The second point is that timely coordination is necessary. We know from the 1st challenge that part of the traffic might already pass by before one flow is selected. Thus, we need to notify monitors about the selected flows as early as possible to avoid more traffic not captured in monitors. The third point is that the coordination overhead should be small as well. One of the benefits of our architecture is that we can reduce the reporting overhead from devices to the controller, as they just need to report selected flows statistics. In order to reduce the overall overhead, we need to avoid introducing too much temporal coordination overhead to the network.