Language-Directed Hardware Design for Network Performance Monitoring

Slides:

Advertisements

Similar presentations

Florin Dinu T. S. Eugene Ng Rice University Inferring a Network Congestion Map with Traffic Overhead 0 zero.

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.

Data Streaming Algorithms for Accurate and Efficient Measurement of Traffic and Flow Matrices Qi Zhao*, Abhishek Kumar*, Jia Wang + and Jun (Jim) Xu* *College.

Programmable Measurement Architecture for Data Centers Minlan Yu University of Southern California 1.

OpenSketch Slides courtesy of Minlan Yu 1. Management = Measurement + Control Trafﬁc engineering – Identify large traffic aggregates, traffic changes.

Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.

CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.

PERSISTENT DROPPING: An Efficient Control of Traffic Aggregates Hani JamjoomKang G. Shin Electrical Engineering & Computer Science UNIVERSITY OF MICHIGAN,

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

Network Simulation Internet Technologies and Applications.

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

NET-REPLAY: A NEW NETWORK PRIMITIVE Ashok Anand Aditya Akella University of Wisconsin, Madison.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Programmable Data Planes COS 597E: Software Defined Networking.

SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.

Noise Can Help: Accurate and Efficient Per-flow Latency Measurement without Packet Probing and Time Stamping Dept. of Computer Science and Engineering.

Vladimír Smotlacha CESNET Full Packet Monitoring Sensors: Hardware and Software Challenges.

Authors: Haiquan (Chuck) Zhao, Hao Wang, Bill Lin, Jun (Jim) Xu Conf. : The 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems.

1 - Charlie Wiseman - 05/11/07 Design Review: XScale Charlie Wiseman ONL NP Router.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,

Chapter 24 Transport Control Protocol (TCP) Layer 4 protocol Responsible for reliable end-to-end transmission Provides illusion of reliable network to.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Connect. Communicate. Collaborate Using Temporal Locality for a Better Design of Flow-oriented Applications Martin Žádník, CESNET TNC 2007, Lyngby.

SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

MOZART: Temporal Coordination of Measurement (SOSR’ 16)

SketchVisor: Robust Network Measurement for Software Packet Processing

Hardware-Software Co-Design for Network Performance Measurement

HULA: Scalable Load Balancing Using Programmable Data Planes

Problem: Internet diagnostics and forensics

5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Jennifer Rexford Princeton University

FlowRadar: A Better NetFlow For Data Centers

A Resource-minimalist Flow Size Histogram Estimator

Data Streaming in Computer Networking

NOX: Towards an Operating System for Networks

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Load Balancing Memcached Traffic Using SDN

Srinivas Narayana MIT CSAIL October 7, 2016

Congestion-Aware Load Balancing at the Virtual Edge

SONATA: Query-Driven Network Telemetry

Designing fast and programmable routers

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Be Fast, Cheap and in Control

Sonata Query-driven Streaming Network Telemetry

Optimal Elephant Flow Detection Presented by: Gil Einziger,

An NP-Based Router for the Open Network Lab Overview by JST

Programmable Data Plane

Fast Reroute in P4: Keep Calm and Enjoy Programmability

Sonata: Query-Driven Streaming Network Telemetry

Programmable Networks

Memento: Making Sliding Windows Efficient for Heavy Hitters

Centralized Arbitration for Data Centers

Programmable Switches

Lecture 16, Computer Networks (198:552)

Project proposal: Questions to answer

Congestion-Aware Load Balancing at the Virtual Edge

Lecture 17, Computer Networks (198:552)

Ran Ben Basat, Xiaoqi Chen, Gil Einziger, Ori Rottenstreich

Cache Memory and Performance

Author: Yi Lu, Balaji Prabhakar Publisher: INFOCOM’09

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

Lu Tang , Qun Huang, Patrick P. C. Lee

Toward Self-Driving Networks

Review of Internet Protocols Transport Layer

Toward Self-Driving Networks

2019/11/12 Efficient Measurement on Programmable Switches Using Probabilistic Recirculation Presenter：Hung-Yen Wang Authors：Ran Ben Basat, Xiaoqi Chen,

Presentation transcript:

Language-Directed Hardware Design for Network Performance Monitoring Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimal Jeyakumar, and Changhoon Kim

Example: Who caused a microburst? Queue build-up deep in the network End-to-end probes Sampling Counters & Sketches Mirror packets Per-pkt info: challenging in software 6.4Tbit/s switch: Need 100M recs/s COTS: 100K-1M recs/s/core

Switches should be first-class citizens in performance monitoring.

Why monitor from switches? Already see the queues & concurrent connections Infeasible to stream all the data out for external processing Can we filter and aggregate performance on switches directly?

We want to build “future-proof” hardware: Language-directed hardware design

Performance monitoring use cases Expressive query language Line-rate switch hardware primitives

Contributions Marple, a performance query language Line-rate switch hardware design Aggregation: Programmable key-value store Query compiler

Queries Query Compiler Marple programs Results Switch programs To collection servers Programmable switches with the key-value store

Marple: Performance query language

Marple: Performance query language Stream: For each packet at each queue, S:= (switch, qid, hdrs, uid, tin, tout, qsize) Location Packet identification Queue entry and exit timestamps Queue depth seen by packet

Marple: Performance query language Stream: For each packet at each queue, S:= (switch, qid, hdrs, uid, tin, tout, qsize) Familiar functional operators filter map zip groupby

Example: High queue latency packets R1 = filter(S, tout – tin > 1 ms)

Example: Per-flow average latency R1 = filter(S, proto == TCP) R2 = groupby(R1, 5tuple, ewma) def ewma([avg],[tin, tout]): avg = (1-⍺)*avg + ⍺*(tout-tin) Fold function

Example: Microburst diagnosis def bursty([last_time, nbursts], [tin]): if tin - last_time > 800 ms: nbursts = nbursts + 1 last_time = tin result = groupby(S, 5tuple, bursty)

Many performance queries (see paper) Transport protocol diagnoses Fan-in problems (incast and outcast) Incidents of reordering and retransmissions Interference from bursty traffic Flow-level metrics Packet drop rates Queue latency EWMA per connection Incidence and lengths of flowlets Network-wide questions Route flapping High end to end latencies Locations of persistently long queues ...

Implementing Marple on switches

Implementing Marple on switches S:= (switch, hdrs, uid, qid, tin, tout, qsize) Switch telemetry [INT SOSR’15] filter map zip groupby Stateless match-action rules [RMT SIGCOMM’13]

Implementing Marple on switches S:= (switch, hdrs, uid, qid, tin, tout, qsize) Switch telemetry [INT SOSR’15] filter map zip groupby Stateless match-action rules [RMT SIGCOMM’13]

Implementing aggregation ewma_query = groupby(S, 5tuple, ewma) def ewma([avg], [tin, tout]): avg = (1-⍺)*avg + ⍺*(tout-tin) Compute & update values at switch line rate (1 pkt/ns) Scale to millions of aggregation keys (e.g., 5-tuples)

Neither SRAM nor DRAM is both fast and dense Challenge: Neither SRAM nor DRAM is both fast and dense

the illusion of fast and large memory Caching: the illusion of fast and large memory

Off-chip backing store (DRAM) Caching Key Value Off-chip backing store (DRAM) Key Value On-chip cache (SRAM)

Off-chip backing store (DRAM) Caching Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Read value for 5-tuple key K Modify value using ewma Write back updated value

Off-chip backing store (DRAM) Caching Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Req. key K Read value for 5-tuple key K K Vback Resp. Vback

Off-chip backing store (DRAM) Caching Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Req. key K Read value for 5-tuple key K K Vback K Vback Resp. Vback

Caching Modify and write must wait for DRAM. Off-chip backing store (DRAM) On-chip cache (SRAM) Modify and write must wait for DRAM. Non-deterministic latencies stall packet pipeline. Key Value Key Value Request key K Read value for 5-tuple key K K V’’ K V’’ Respond K, V’’

Instead, we treat cache misses as packets from new flows.

Cache misses as new keys Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Read value for key K K V0

Cache misses as new keys Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Evict K’,V’cache Read value for key K K’ V’back K V0

Cache misses as new keys Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Evict K’,V’cache Read value for key K K’ V’back Merge K V0

Cache misses as new keys Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Evict K’,V’cache Read value for key K K’ V’back K V0 Nothing to wait for.

Cache misses as new keys Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Packet processing doesn’t wait for DRAM. Retain 1 pkt/ns processing rate!👍 Key Value Evict K’,Vsram Read value for key K K’ V’dram K V0 (nothing returns)

How about value accuracy after evictions? Merge should “accumulate” results of folds across evictions Suppose: Fold function g over a packet sequence p1, p2, … g([pi]) Action of g over a packet sequence, e.g., EWMA

merge(g([qj]), g([pi])) = g([p1,…,pn,q1,…,qm]) The Merge operation Vcache Vback merge(g([qj]), g([pi])) = g([p1,…,pn,q1,…,qm]) Example: if g is a counter, merge is just addition! Fold over the entire packet sequence

Mergeability Can merge any fold g by storing entire pkt sequence in cache … but that’s a lot of extra state! Can we merge a given fold with “small” extra state? Small: extra state size ≈ size of the state used by the fold itself

(see formal result in paper) There are useful fold functions that require a large amount of extra state to merge. (see formal result in paper)

Linear-in-state: Mergeable w. small extra state S = A * S + B Examples: Packet and byte counters, EWMA, functions over a window of packets, … State of the fold function Functions of a bounded number of packets in the past

Example: EWMA merge operation EWMA : S = (1-⍺)*S + ⍺*(func of current packet) merge(Vcache, Vback) = Vcache + (1-⍺)N * (Vback – V0) Small extra state N: # pkts processed by cache

Microbursts: Linear-in-state! def bursty([last_time, nbursts], [tin]): if tin - last_time > 800 ms: nbursts = nbursts + 1 last_time = tin result = groupby(S, 5tuple, bursty) nbursts: S = A * S + B, where A = 1 B = {1, if current pkt within time gap from last; 0 otherwise}

Other linear-in-state queries Counting successive TCP packets that are out of order Histogram of flowlet sizes Counting number of timeouts in a TCP connection … 7/10 example queries in our paper

Is processing the evictions feasible? Evaluation: Is processing the evictions feasible?

Off-chip backing store (DRAM) Eviction processing Off-chip backing store (DRAM) On-chip cache (SRAM) Key Value Key Value Evict K’,V’cache K’ V’back K V0

Eviction processing at backing store Trace-based evaluation: “Core14”, “Core16”: Core router traces from CAIDA (2014, 16) “DC”: University data center trace from [Benson et al. IMC ’10] Each has ~100M packets Query aggregates by 5-tuple (key) Show results for key+value size of 256 bits 8-way set-associative LRU cache eviction policy Eviction ratio: % of incoming pkts that result in a cache eviction

Eviction ratio vs. Cache size

Eviction ratio vs. Cache size 218 keys == 64 Mbits 4% pkt eviction ratio 25X reduction from processing each pkt

Eviction ratio  Eviction rate Consider 64-port X 100-Gbit/s switch Memory: 256 Mbits 7.5% area Eviction rate: 8M records/s ~ 32 cores

See more in the paper… More performance query examples Query compilation algorithms Evaluating hardware resources for stateful computations Implementation & end-to-end walkthroughs on mininet

Summary Come see our demo! alephtwo@csail.mit.edu On-switch aggregation reduces software data processing Linear in state: fully accurate per-flow aggregation at line rate alephtwo@csail.mit.edu http://web.mit.edu/marple