Download presentation
Presentation is loading. Please wait.
Published byBlake Rodgers Modified over 9 years ago
1
Enabling Flow-level Latency Measurements across Routers in Data Centers Parmjeet Singh, Myungjin Lee Sagar Kumar, Ramana Rao Kompella
2
Latency-critical applications in data centers Guaranteeing low end-to-end latency is important Web search (e.g., Google’s instant search service) Retail advertising Recommendation systems High-frequency trading in financial data centers Operators want to troubleshoot latency anomalies End-host latencies can be monitored locally Detection, diagnosis and localization through a network: no native support of latency measurements in a router/switch
3
Prior solutions Lossy Difference Aggregator (LDA) Kompella et al. [SIGCOMM ’09] Aggregate latency statistics Reference Latency Interpolation (RLI) Lee et al. [SIGCOMM ’10] Per-flow latency measurements More suitable due to more fine-grained measurements
4
Deployment scenario of RLI Upgrading all switches/routers in a data center network Pros Provide finest granularity of latency anomaly localization Cons Significant deployment cost Possible downtime of entire production data centers In this work, we are considering partial deployment of RLI Our approach: RLI across Routers (RLIR)
5
Overview of RLI architecture Goal Latency statistics on a per-flow basis between interfaces Problem setting No storing timestamp for each packet at ingress and egress due to high storage and communication cost Regular packets do not carry timestamps Router Ingress I Egress E
6
Overview of RLI architecture Premise of RLI: delay locality Approach 1) The injector sends reference packets regularly 2) Reference packet carries ingress timestamp 3) Linear interpolation: compute per-packet latency estimates at the latency estimator 4) Per-flow estimates by aggregating per-packet estimates Latency Estimator Reference Packet Injector Ingress IEgress E 12 21 R L Delay Time R L 1 Linear interpolation line 2 Interpolated delay
7
Full vs. Partial deployment Full deployment: 16 RLI sender-receiver pairs Partial deployment: 4 RLI senders + 2 RLI receivers 81.25 % deployment cost reduction Switch 1Switch 5 Switch 2Switch 4 Switch 3 Switch 6 RLI Sender (Reference Packet Injector)RLI Receiver (Latency Estimator)
8
Case 1: Presence of cross traffic Issue: Inaccurate link utilization estimation at the sender leads to high reference packet injection rate Approach Not actively addressing the issue Evaluation shows no much impact on packet loss rate increase Details in the paper Switch 1Switch 5 Switch 2Switch 4 Switch 3 Switch 6 RLI Sender (Reference Packet Injector)RLI Receiver (Latency Estimator) Link utilization estimation on Switch 1 Bottleneck Link Cross Traffic
9
Case 2: RLI Sender side Issue: Traffic may take different routes at an intermediate switch Approach: Sender sends reference packets to all receivers Switch 1Switch 5 Switch 2Switch 4 Switch 3 Switch 6 RLI Sender (Reference Packet Injector)RLI Receiver (Latency Estimator)
10
Case 3: RLI Receiver side Issue: Hard to associate reference packets and regular packets that traversed the same path Approaches Packet marking: requires native support from routers Reverse ECMP computation: ‘reverse’ engineer intermediate routes using ECMP hash function IP prefix matching at limited situation Switch 1Switch 5 Switch 2Switch 4 Switch 3 Switch 6 RLI Sender (Reference Packet Injector)RLI Receiver (Latency Estimator)
11
Deployment example in fat-tree topology RLI Sender (Reference Packet Injector)RLI Receiver (Latency Estimator) IP prefix matching Reverse ECMP computation / IP prefix matching
12
Evaluation Simulation setup Trace: regular traffic (22.4M pkts) + cross traffic (70M pkts) Simulator Results Accuracy of per-flow latency estimates Traffic Divider Switch1Switch2 Cross Traffic Injector Packet Trace RLI Receiver RLI Sender Cross Traffic Regular Traffic Reference packets 10% / 1% injection rate
13
67% Accuracy of per-flow latency estimates 10% injection 1% injection 10% injection 1% injection Bottleneck link utilization:93% Relative error CDF 1.2%4.5% 18%31%
14
Summary Low latency applications in data centers Localization of latency anomaly is important RLI provides flow-level latency statistics, but full deployment (i.e., all routers/switches) cost is expensive Proposed a solution enabling partial deployment of RLI No too much loss in localization granularity (i.e., every other router)
15
Thank you! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.