Tail Latency: Networking

Slides:



Advertisements
Similar presentations
Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley.
Advertisements

Improving Datacenter Performance and Robustness with Multipath TCP
TCP--Revisited. Background How to effectively share the network? – Goal: Fairness and vague notion of equality Ideal: If N connections, each should get.
Finishing Flows Quickly with Preemptive Scheduling
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
SDN + Storage.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Cloud Control with Distributed Rate Limiting Raghaven et all Presented by: Brian Card CS Fall Kinicki 1.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Playback-buffer Equalization for Streaming Media using Stateless Transport Prioritization Dan Tan, HPL, Palo Alto Weidong Cui, UC Berkeley John Apostolopoulos,
* Mellanox Technologies LTD, + Technion - EE Department
Cross-Layer Scheduling in Cloud Systems Hilfi Alkaff, Indranil Gupta, Luke Leslie Department of Computer Science University of Illinois at Urbana-Champaign.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
03/12/08Nuova Systems Inc. Page 1 TCP Issues in the Data Center Tom Lyon The Future of TCP: Train-wreck or Evolution? Stanford University
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
Switching, routing, and flow control in interconnection networks.
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.
Network Sharing Issues Lecture 15 Aditya Akella. Is this the biggest problem in cloud resource allocation? Why? Why not? How does the problem differ wrt.
Sharing the Data Center Network Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, Bikas Saha Microsoft Research, Cornell University, Windows.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
Distributed Multimedia March 19, Distributed Multimedia What is Distributed Multimedia?  Large quantities of distributed data  Typically streamed.
Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Random Early Detection (RED) Router notifies source before congestion happens - just drop the packet (TCP will timeout and adjust its window) - could make.
Queue Scheduling Disciplines
LECTURE 12 NET301 11/19/2015Lect NETWORK PERFORMANCE measures of service quality of a telecommunications product as seen by the customer Can.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts - 7 th Edition, Feb 7, 2006 Chapter 3: Processes Process Concept.
Spring Computer Networks1 Congestion Control Sections 6.1 – 6.4 Outline Preliminaries Queuing Discipline Reacting to Congestion Avoiding Congestion.
A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.
Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.
C-Through: Part-time Optics in Data centers Aditi Bose, Sarah Alsulaiman.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.
Process Scheduling. Scheduling Strategies Scheduling strategies can broadly fall into two categories  Co-operative scheduling is where the currently.
Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.
Topics discussed in this section:
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
ECE 544: Traffic engineering (supplement)
Alternative system models
Managing Data Transfer in Computer Clusters with Orchestra
Improving Datacenter Performance and Robustness with Multipath TCP
Queue Management Jennifer Rexford COS 461: Computer Networks
Improving Datacenter Performance and Robustness with Multipath TCP
Cof low A Networking Abstraction for Distributed
湖南大学-信息科学与工程学院-计算机与科学系
AMP: A Better Multipath TCP for Data Center Networks
CPU SCHEDULING.
Lecture 17, Computer Networks (198:552)
Review of Internet Protocols Transport Layer
Towards Predictable Datacenter Networks
Lecture 6, Computer Networks (198:552)
Presentation transcript:

Tail Latency: Networking

The story thus far Tail latency is bad Causes: Resource contention with background jobs Device failure Uneven-split of data between tasks Network congestion for reducers

Ways to address tail latency Clone all tasks Clone slow tasks Copy intermediate data Remove/replace frequently failing machines Spread out reducers

What is missing from this picture? Networking: Spreading out reducers is not sufficient. The network is extremely crucial Studies on Facebook traces show that [Orchestra] in 26% of jobs, shuffle is 50% of runtime. in 16% of jobs, shuffle is more than 70% of runtime 42% of tasks spend over 50% of their time writing to HDFS

Other implication of Network Limits Scalability Scalability of Netflix-like recommendation system is bottlenecked by communication Did not scale beyond 60 nodes Comm. time increased faster than comp. time decreased Orchestra slide

What is the Impact of the Network Assume 10ms deadline for tasks [DCTCP] Simulate job completion times based on distributions of tasks completion times For 40 about 4 tasks (14%) for 400 14 tasks [3%] fail respectively DeTail

What is the Impact of the Network Assume 10ms deadline for tasks [DCTCP] Simulate job completion times based on distributions of tasks completion times (focus on 99.9%) For 40 about 4 tasks (14%) for 400 14 tasks [3%] fail respectively D3 Slides

What is the Impact of the Network Assume 10ms deadline for tasks [DCTCP] Simulate job completion times based on distributions of tasks completion times For 40 about 4 tasks (14%) for 400 14 tasks [3%] fail respectively Detail slides

Other implication of Network Limits Scalability Scalability of Netflix-like recommendation system is bottlenecked by communication Did not scale beyond 60 nodes Comm. time increased faster than comp. time decreased Orchestra slide

What Causes this Variation in Network Transfer Times? First let’s look at type of traffic in network Background Traffic Latency sensitive short control messages; e.g. heart beats, job status Large files: e.g. HDFS replication, loading of new data Map-reduce jobs Small RPC-request/response with tight deadlines HDFS reads or writes with tight deadlines

What Causes this Variation in Network Transfer Times? No notion of priority Latency sensitive and non-latency sensitive share the network equally. Uneven load-balancing ECMP doesn’t schedule flows evenly across all paths Assume long and short are the same Bursts of traffic Networks have buffers which reduce loss but introduce latency (time waiting in buffer is variable) Kernel optimization introduce burstiness

Ways to Eliminate Variation and Improve tail latency Make the network faster HULL, DeTail, DCTCP Faster networks == smaller tail Optimize how application use the network Orchestra, CoFlows Specific big-data transfer patterns, optimize the patterns to reduce transfer time Make the network aware of deadlines D3, PDQ Tasks have deadlines. No point doing any work if deadline wouldn’t be met Try and prioritize flows and schedule them based on deadline.

Fair-Sharing or Deadline-based sharing Fair-share (Status-Quo) Every one plays nice but some deadlines lines can be missed Deadline-based Deadlines met but may require non-trial implemantionat Two ways to do deadline-based sharing Earliest deadline first (PDQ) Make BW reservations for each flow Flow rate = flow size/flow deadline Flow size & deadline are known apriori

Fair-Sharing or Deadline-based sharing Two versions of deadline-based sharing Earliest deadline first (PDQ) Make BW reservations for each flow Flow rate = flow size/flow deadline Flow size & deadline are known apriori D3 slides

Issues with Deadline Based Scheduling Implications for non-deadline based jobs Starvation? Poor completion times? Implementation Issues Assign deadlines to flows not packets Reservation approach Requires reservation for each flow Big data flows: can be small & have small RTT Control loop must be extremelly fast Earliest deadline first Requires coordination between switches & servers Servers: specify flow deadline Switches: priority flows and determine rate May require complex switch mechanisms

How do you make the Network Faster Throw more hardware at the problem Fat-Tree, VL2, B-Cube, Dragonfly Increases bandwidth (throughput) but not necessarily latency

So, how do you reduce latency Trade bandwidth for latency Buffering adds variation (unpredictability) Eliminate network buffering & bursts Optimize the network stack Use link level information to detect congestion Inform application to adapt by using a different path

HULL: Trading BW for Latency Buffering introduces latency Buffer is used to accommodate bursts To allow congestion control to get good throughput Removing buffers means Lower throughput for large flows Network can’t handle bursts Predictable low latency

Why do Bursts Exists? Systems review: NIC (network Card) informs OS of packets via interrupt Interrupt consume CPU If one interrupt for each packet the CPU will be overwhelmed Optimization: batch packets up before calling interrupt Size of the batch is the size of the burst

Why do Bursts Exists? Systems review: NIC (network Card) informs OS of packets via interrupt Interrupt consume CPU If one interrupt for each packet the CPU will be overwhelmed Optimization: batch packets up before calling interrupt Size of the batch is the size of the burst Hull slides. Would like to use actual table or something better here.

Why Does Congestion Need buffers? Congestion Control AKA TCP Detects bottleneck link capacity through packet loss When loss it halves its sending rate. Buffers help the keep the network busy Important for when TCP reduce sending rate by half Essentially the network must double capacity for TCP to work well. Buffer allow for this doubling BAD SLIDE: NEEDS HELP

TCP Review Bandwidth-delay product rule of thumb: B < C×RTT A single flow needs C×RTT buffers for 100% Throughput. B 100% B < C×RTT 100% B B ≥ C×RTT Buffer Size DCTCP paper Now in the case of TCP, the question of how much buffering is needed for high throughput has been studied and is known in the literature as the buffer sizing problem. End with: “So we need to find a way to reduce the buffering requirements.” Throughput

Key Idea Behind Hull Eliminate Bursts Eliminate Buffering Add a token bucket (Pacer) into the network Pacer must be in the network so it happens after the system optimizations that cause bursts. Eliminate Buffering Send congestion notification messages before link it fully utilized Make applications believe the link is full when there’s still capacity TCP has poor congestion control algorithm Replace with DCTCP

Key Idea Behind Hull Eliminate Bursts Eliminate Buffering Add a token bucket (Pacer) into the network Pacer must be in the network so it happens after the system optimizations that cause bursts. Eliminate Buffering Send congestion notification messages before link it fully utilized Make applications believe the link is full when there’s still capacity

Orchestra: Managing Data Transfers in Computer Clusters Group all flows belonging to a stage into a transfer Perform inter-transfer coordination Optimize at the level of transfer rather than individual flows Orchestra slide

Transfer Patterns HDFS Transfer: set of all flows transporting data between two stages of a job Acts as a barrier Completion time: Time for the last receiver to finish Broadcast Map Map Map Shuffle Reduce Reduce Incast* HDFS

Orchestra Cooperative broadcast (Cornet) Infer and utilize topology information Weighted Shuffle Scheduling (WSS) Assign flow rates to optimize shuffle completion time Inter-Transfer Controller Implement weighted fair sharing between transfers End-to-end performance ITC Inter-Transfer Controller (ITC) Fair sharing FIFO Priority Shuffle Transfer Controller (TC) TC (shuffle) Broadcast Transfer Controller (TC) TC (broadcast) TC (broadcast) Broadcast Transfer Controller (TC) Hadoop shuffle WSS HDFS Tree Cornet HDFS Tree Cornet Orchestra slide shuffle broadcast 1 broadcast 2

Cornet: Cooperative broadcast Broadcast same data to every receiver Fast, scalable, adaptive to bandwidth, and resilient Peer-to-peer mechanism optimized for cooperative environments Use bit-torrent to distribute data Orchestra slide Observations Cornet Design Decisions High-bandwidth, low-latency network Large block size (4-16MB) No selfish or malicious peers No need for incentives (e.g., TFT) No (un)choking Everyone stays till the end Topology matters Topology-aware broadcast

Topology-aware Cornet Many data center networks employ tree topologies Each rack should receive exactly one copy of broadcast Minimize cross-rack communication Topology information reduces cross-rack data transfer Mixture of spherical Gaussians to infer network topology Orchestra slide

Shuffle bottlenecks At a sender At a receiver In the network Orchestra slide An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer

Status quo in Shuffle r1 r2 s1 s2 s3 s4 s5 Orchestra slide Links to r1 and r2 are full: 3 time units Link from s3 is full: 2 time units Completion time: 5 time units

Weighted Shuffle Scheduling Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent r1 r2 1 2 s1 s2 s3 s4 s5 Orchestra slide Completion time: 4 time units Up to 1.5X improvement

Faster spam classification Orchestra slide Communication reduced from 42% to 28% of the iteration time Overall 22% reduction in iteration time

Summary Discuss tail latency in network Discuss Hull: Types of traffic in network Implications on jobs Cause of tail latency Discuss Hull: Trade Bandwidth for latency Penalize huge flows Eliminate bursts and buffering Discuss Orchestra: Optimize transfers instead of individual flows Utilize knowledge about application semantics http://www.mosharaf.com/