Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Fidelity Latency Measurements in Low-Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)

Similar presentations


Presentation on theme: "High-Fidelity Latency Measurements in Low-Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)"— Presentation transcript:

1 High-Fidelity Latency Measurements in Low-Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)

2 Low Latency Applications  Many important data center applications require low end-to-end latencies (microseconds)  High Performance Computing – lose parallelism  Cluster Computing, Storage – lose performance  Automated Trading – lose arbitrage opportunities 2 Stanford

3 Low Latency Applications  Many important data center applications require low end-to-end latencies (microseconds)  High Performance Computing – lose parallelism  Cluster Computing, Storage – lose performance  Automated Trading – lose arbitrage opportunities  Cloud applications  Recommendation Systems, Social Collaboration  All-up SLAs of 200ms [AlizadehSigcomm10]  Involves backend computation time and network latencies have little budget 3 Stanford

4 ………… ToR S/W Edge Router Core Router … Latency Measurements are Needed 4  At every router, high-fidelity measurements are critical to localize root causes  Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis Which router causes the problem?? 1ms1ms Router Measurement within a router is necessary Stanford

5 Vision: Knowledge Plane 5 Knowledge Plane Data Center Network Response Query SLA Diagnosis Routing/Traffic Engineering Scheduling/Job Placement Latency Measurements Query Interface Latency Measurements Push Pull Stanford

6 Contributions Thus Far…  Aggregate Latency Estimation  Lossy Difference Aggregator – Sigcomm 2009  FineComb – Sigmetrics 2011  mPlane – ReArch 2009  Differentiated Latency Estimation  Multiflow Estimator – Infocom 2010  Reference Latency Interpolation – Sigcomm 2010  RLI across Routers – Hot-ICE 2011  Delay Sketching – (under review at Sigcomm 2011)  Scalable Query Interface  MAPLE – (under review at Sigcomm 2011) 6 Per-flow latency measurements at every hop Per-Packet Latency Measurements Stanford

7 1) PER-FLOW MEASUREMENTS WITH REFERENCE LATENCY INTERPOLATION [SIGCOMM 2010] Stanford 7

8  Native router support: SNMP, NetFlow  No latency measurements  Active probes and tomography  Too many probes (~10000HZ) required wasting bandwidth  Use expensive high-fidelity measurement boxes  London Stock Exchange uses Corvil boxes  Cannot place them ubiquitously  Recent work: LDA [Kompella09Sigcomm]  Computes average latency/variance accurately within a switch  Provides a good start but may not be sufficient to diagnose flow- specific problems Obtaining Fine-Grained Measurements 8 Stanford

9 From Aggregates to Per-Flow 9 Delay Time S/W … Queue Average latency Interval Large delay Small delay  Observation: Significant amount of difference in average latencies across flows at a router  Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ? Stanford

10 Measurement Model Router Egress E Ingress I 10  Assumption: Time synchronization between router interfaces  Constraint: Cannot modify regular packets to carry timestamps  Intrusive changes to the routing forwarding path Stanford

11 Naïve Approach 11  For each flow key,  Store timestamps for each packet at I and E  After a flow stops sending, I sends the packet timestamps to E  E computes individual packet delays  E aggregates average latency, variance, etc for each flow  Problem: High communication costs  At 10Gbps, few million packets per second  Sampling reduces communication, but also reduces accuracy Ingress I Egress E 10 −= 20 23 27 30 + 15 13 18 − = 22 32 Avg. delay = 22/2 = 11 Avg. delay = 32/2 = 16 −+− Stanford

12 A (Naïve) Extension of LDA 12  Maintaining LDAs with many counters for flows of interest  Problem: (Potentially) high communication costs  Proportional to the number of flows Ingress I Egress E LDA 28 15 2 1 Packet count Sum of timestamps … Coordination Per-flow latency Stanford

13 Key Observation: Delay Locality 13 True mean delay = (D1 + D2 + D3) / 3 Localized mean delay = (WD1 + WD2 + WD3) / 3 WD1 WD3 WD2 How close is localized mean delay to true mean delay as window size varies? Delay Time D1D2 D3 Stanford

14 Key Observation: Delay Locality 14 True Mean delay per key / ms Local mean delay per key / ms Global Mean 0.1ms: RMSRE=0.054 10ms: RMSRE=0.16 1s: RMSRE=1.72 Data sets from real router and synthetic queueing models Stanford

15 Exploiting Delay Locality 15  Reference packets are injected regularly at the ingress I  Special packets carrying ingress timestamp  Provides some reference delay values (substitute for window averages)  Used to approximate the latencies of regular packets Delay Time Reference Packet Ingress Timestamp Stanford

16 RLI Architecture 16  Component 1: Reference Packet generator  Injects reference packets regularly  Component 2: Latency Estimator  Estimates packet latencies and updates per-flow statistics  Estimates directly at the egress with no extra state maintained at ingress side (reduces storage and communication overheads) Egress EIngress I 1) Reference Packet Generator 2) Latency Estimator 1 2 3 1 2 3 R L Ingress Timestamp Stanford

17 Component 1: Reference Packet Generator 17  Question: When to inject a reference packet ?  Idea 1: 1-in-n: Inject one reference packet every n packets  Problem: low accuracy under low utilization  Idea 2: 1-in- τ : Inject one reference packet every τ seconds  Problem: bad in case where short-term delay variance is high  Our approach: Dynamic injection based on utilization  High utilization  low injection rate  Low utilization  high injection rate  Adaptive scheme works better than fixed rate schemes  Details in the paper Stanford

18 Component 2: Latency Estimator 18  Question 1: How to estimate latencies using reference packets ?  Solution: Different estimators possible  Use only the delay of a left reference packet (RLI-L)  Use linear interpolation of left and right reference packets (RLI)  Other non-linear estimators possible (e.g., shrinkage) L Interpolated delay Delay Time R Error in delay estimate Regular Packet Reference Packet Linear interpolation line Arrival time is known Arrival time and delay are known Estimated delay Error in delay estimate R Stanford

19 Component 2: Latency Estimator 19 Flow key C1C2C3 81139 236 Interpolation buffer Estimate 102080 347 Avg. latency = C2 / C1 R L Right Reference Packet arrived When a flow is exported  Question 2: How to compute per-flow latency statistics  Solution: Maintain 3 counters per flow at the egress side  C1: Number of packets  C2: Sum of packet delays  C3: Sum of squares of packet delays (for estimating variance)  To minimize state, can use any flow selection strategy to maintain counters for only a subset of flows Flow Key 451 Delay Square of delay 16251 Update Any flow selection strategy Update Selection Stanford

20 Experimental Setup 20  Data sets  No public data center traces with timestamps  Real router traces with synthetic workloads: WISC  Real backbone traces with synthetic queueing: CHIC and SANJ  Simulation tool: Open source NetFlow software – YAF  Supports reference packet injection mechanism  Simulates a queueing model with RED active queue management policy  Experiments with different link utilizations Stanford

21 Accuracy under High Link Utilization 21 Relative error CDF Median relative error is 10-12% Stanford

22 Comparison with Other Solutions 22 Utilization Average relative error Packet sampling rate = 0.1% 1-2 orders of magnitude difference Stanford

23 Overhead of RLI 23  Bandwidth overhead is low  less than 0.2% of link capacity  Impact to packet loss is small  Packet loss difference with and without RLI is at most 0.001% at around 80% utilization Stanford

24 Summary  A scalable architecture to obtain high-fidelity per-flow latency measurements between router interfaces  Achieves a median relative error of 10-12%  Obtains 1-2 orders of magnitude lower relative error compared to existing solutions  Measurements are obtained directly at the egress side 24 Stanford

25 Contributions Thus Far…  Aggregate Latency Estimation  Lossy Difference Aggregator – Sigcomm 2009  FineComb – Sigmetrics 2011  mPlane – ReArch 2009  Differentiated Latency Estimation  Multiflow Estimator – Infocom 2010  Reference Latency Interpolation – Sigcomm 2010  RLI across Routers – Hot-ICE 2011  Virtual LDA – (under review at Sigcomm 2011)  Scalable Query Interface  MAPLE – (under review at Sigcomm 2011) 25 Stanford

26 2) SCALABLE PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (UNDER REVIEW AT SIGCOMM 2011) Stanford 26

27 MAPLE Motivation  LDA and RLI are ossified in the aggregation level  Not suitable for obtaining arbitrary sub- population statistics  Single packet delay may be important  Key Goal: How to enable a flexible and scalable architecture for packet latencies ? 27 Stanford

28 1) Packet Latency Store 2) Query Engine Timestamp Unit MAPLE Architecture  Timestamping not strictly required  Can work with RLI estimated latencies Router A Router B P1 T1P1D1 P1 Central Monitor Q(P1) A(P1) Stanford 28

29 Packet Latency Store (PLS)  Challenge: How to store packet latencies in the most efficient manner ?  Naïve idea: Hashtables does not scale well  At a minimum, require label (32 bits) + timestamp (32 bits) per packet  To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%)  Can we do better ? Stanford 29

30 Our Approach  Idea 1: Cluster packets  Typically few dominant values  Cluster packets into equivalence classes  Associate one delay value with a cluster  Choose cluster centers such that error is small  Idea 2: Provision storage  Naïvely, we can use one Bloom Filter per cluster (Partitioned Bloom Filter)  We propose a new data structure called Shared Vector Bloom Filter (SVBF) that is more efficient 30 Stanford

31 Selecting Representative Delays  Approach 1: Logarithmic delay selection  Divide delay range into logarithmic intervals  E.g., 0.1-10,000μs  0.1-1μs, 1-10μs …  Simple to implement, bounded relative error, but accuracy may not be optimal  Approach 2: Dynamic clustering  k-means (medians) clustering formulation  Minimizes the average absolute error of packet latencies (minimizes total Euclidean distance)  Approach 3: Hybrid clustering  Split centers equally across static and dynamic  Best of both worlds 31 Stanford

32 K-means  Goal: Determine k-centers every measurement cycle  Can be formulated as a k-means clustering algorithm  Problem 1: Running k-means typically hard  Basic algorithm has O(n k+1 log n) run time  Heuristics (Lloyd’s algorithm) also complicated in practice  Solution: Sampling and streaming algorithms  Use sampling to reduce n to pn  Use a streaming k-medians algorithm (approximate but sufficient)  Problem 2: Can’t find centers and record membership at the same time  Solution: Pipelined implementation  Use previous interval’s centers as an approximation for this interval 32 Stanford

33 Streaming k -Medians [ CharikarSTOC03 ] Packet Sampling Packet Sampling Online Clusterin g Stage Online Clusterin g Stage Offline Clusterin g Stage Offline Clusterin g Stage SOFTWARE Storage Data Structure Packet Stream HARDWARE DRAM/SSD Data k-centers Flushed after every epoch for archival Packets in (i+2)th epoch np packets at i-th epoch O(k log(np) centers at (i+1)th epoch Stanford 33

34 Naïve: Partitioned BF (PBF) c1 c3 c2 c4 Packet Latency Parallel matching of closest center 1 1 1 1 0 0 1 1 1 1 … 1 1 0 0 0 0 1 1 1 1 … 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 … 0 0 0 0 1 1 1 1 … Bits are set by hashing packet contents INSERTION Packet Contents Query all Bloom filters 1 1 1 1 0 0 1 1 1 1 … 1 1 0 0 0 0 1 1 1 1 … 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 … 0 0 0 0 1 1 1 1 … All bits are 1 LOOKUP c1 c3 c2 c4 Stanford 34

35 Problems with PBF  Provisioning is hard  Cluster sizes not known apriori  Over-estimation or under estimation of BF sizes  Lookup complexity is higher  Need the data structure to be partitioned every cycle  Need to lookup multiple random locations in the bitmap (based on number of hash functions) Stanford 35

36 Shared-Vector Bloom Filter c1 c3 c2 c4 Packet Latency Parallel matching of closest center 0 0 0 0 0 0 1 1 … 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 INSERTION Packet Contents LOOKUP H1 H2 c2 Bit is set to 1 after offset by the number of matched center Bit position is located by hashing 0 0 0 0 1 1 1 1 … 0 0 1 1 0 0 0 0 1 1 1 1 1 1 H1 H2 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 Packet Contents AND 0 0 1 1 0 0 0 0 Bulk read Offset is center id # of centers Stanford 36

37 Comparing PBF and SVBF  PBF − Lookup is not easily parallelizable − Provisioning is hard since number of packets per BF is not known apriori  SVBF + One Bloom filter is used + Burst read at the length of word  COMB [Hao10Infocom] + Single BF with groups of hash functions − More memory usage than SVBF and burst read not possible Stanford 37

38 Comparing Storage Needs Data Structure # of Hash functions Capacity (bits/entry) InsertionLookupNote HashTable114711Storing only latency value (no label) PBF912.89450Provisioning is hard (12.8 if cardinality known before) COMB712.81477(alternate combinations exist) SVBF912.8927 (burst reads) Provisioning is easy Stanford 38 For same classification failure rate of 1% and 50 centers (k=50)

39 Tie-Breaking Heuristic  Bloom filters have false positives  Lookups involve search across all BFs  So, multiple BFs may return match  Tie-breaking heuristic returns the group that has the highest cardinality  Store a counter per center to store number of packets that match the center (cluster cardinality)  Works well in practice (especially when skewed distributions) 39 Stanford

40 Estimation Accuracy CDF Absolute error ( μs ) Stanford 40

41 Accuracy of Aggregates CDF Relative error Stanford 41

42 2) Query Engine MAPLE Architecture Router A Router B Central Monitor Q(P1) A(P1) Stanford 42

43 Query Interface  Assumption: Path of a packet is known  Possible to determine using forwarding tables  In OpenFlow-enabled networks, controller has the information  Query answer:  Latency estimate  Type: (1) Match, (2) Multi-Match, (3) No-Match Stanford 43

44 Query Bandwidth  Query method 1: Query using packet hash  Hashed using invariant fields in a packet header  High query bandwidth for aggregate latency statistics (e.g., flow- level latencies)  Query method 2: Query using flow key and IP identifier  Support range search to reduce query bandwidth overhead  Inserts: use flow key and IPID for hashing  Query: use a flow key and ranges of continuous IPIDs are sent f1 1 1 5 5 20 35 Query message: Continuous IPID block Stanford 44

45 Query Bandwidth Compression CDF Compression ratio Stanford 45 Median compression per flow reduces bw by 90%

46 Storage  OC192 interface  5 Million packets  60Mbits per second  Assuming 10% utilization, 6 Mbits per second  DRAM – 16 GB  40 minutes of packets  SSD – 256 GB  10 hours – enough time for diagnosis 46 Stanford

47 Summary  RLI and LDA are ossified in their aggregation level  Proposed MAPLE as a mechanism to compute measurements across arbitrary sub- populations  Relies on clustering dominant delay values  Novel SVBF data structure to reduce storage and lookup complexity 47 Stanford

48 Conclusion  Many applications demand low latencies  Network operators need high-fidelity tools for latency measurements  Proposed RLI for fine-grained per-flow measurements  Proposed MAPLE to:  Store per-packet latencies in a scalable way  Compose latency aggregates across arbitrary sub- populations  Many other solutions (papers on my web page) 48 Stanford

49 Sponsors 49  CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks  CNS – 0831647: NSF NECO: Architectural Support for Fault Management  Cisco Systems: Designing Router Primitives for Monitoring Network Health Stanford


Download ppt "High-Fidelity Latency Measurements in Low-Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)"

Similar presentations


Ads by Google