Download presentation
Presentation is loading. Please wait.
1
The Synchronous Data Center
Tian Yang Robert Gifford Andreas Haeberlen Linh Thi Xuan Phan Department of Computer and Information Science University of Pennsylvania HotOS XVII (May 14, 2019)
2
If trains were asynchronous…
Station clocks would be at most loosely synchronized Congestion would appear at unpredictable times Station stops could take an arbitrary amount of time Trains would have arbitrary delays and would often be lost entirely HotOS XVII (May 14, 2019)
3
The Asynchrony assumption
System designers typically assume that: Clocks are at most loosely synchronized Network latencies are unpredictable Packets are often dropped in the network We don’t know much about node speeds This is often a good idea! Sometimes we really don’t know Example: System with multiple administrative domains Nicely conservative assumption If the system works in this model, it almost certainly will work on the actual hardware This is the “default”; rarely questioned HotOS XVII (May 14, 2019)
4
Asynchrony can be expensive
But: No time bounds can be given on anything This makes many things very difficult! Example: Congestion control Example: Fault detection Example: Consistency Example: Fighting tail latencies HotOS XVII (May 14, 2019)
5
It doesn’t have to be that way!
The train network is not asynchronous Single administrative domain (like a data center!) Carefully scheduled; speeds and timings are (mostly) known Not all distributed systems are, either! Example: Cyber-physical systems Clocks are closely in sync Network traffic is scheduled; hard latency bounds are known No congestion losses! (And transmission losses are rare) Node speeds and execution times are known exactly CPS are mostly synchronous (out of necessity)! HotOS XVII (May 14, 2019)
6
So what? Synchrony helps in two ways: How does that help us?
Hard latency bounds -> we know how long we need to wait! Absence of a message at a particular time means something How does that help us? No (surprising) congestion anymore Fault detection would be much easier Consistency would be easier to get Long latency tails would disappear Many algorithms become simpler, or even trivial (“boring”) Workloads with timing requirements can be supported Example: Inflate airbag when sensors detect a collision HotOS XVII (May 14, 2019)
7
Could a data center be synchronous?
At first glance, absolutely not! Some objections: Network is shared, so packet delays are unpredictable! Who knows how long anything takes under Linux? Clocks can’t be synchronized closely enough! Our claim: But: Fastpass (SIGCOMM’14) Real-time operating systems Spanner (OSDI’12) Most of the asynchrony in today’s data centers is avoidable! HotOS XVII (May 14, 2019)
8
Outline Goal: Synchronous data center How could it be done?
Network layer Synchronized clocks Building blocks Hardware Software Scheduling NEXT HotOS XVII (May 14, 2019)
9
The How: Network layer Why is latency so unpredictable?
Cross-traffic and queueing! Inspiration: Fastpass (SIGCOMM’14) Machines must ask an ‘arbiter’ for permission before sending Arbiter schedules packets (at >2Tbit/s on eight cores!) Result: (almost) no queueing in the network! No attempt to control end-to-end timing But we see no reason why this couldn’t be added! HotOS XVII (May 14, 2019)
10
The How: Synchronized clocks
Why are clocks so hard to synchronize? Hard to do in the wide area, or via NTP (with cross-traffic) But it can be done: DTP (SIGCOMM’16) achieves nanosecond-precision … with some help from the hardware Google Spanner (OSDI’12) keeps different data centers to within ~4ms … with some help from atomic clocks Having predictable network latencies should help, too! Figure 6(a) from the DTP paper Figure 6 from the Spanner paper HotOS XVII (May 14, 2019)
11
The How: Building blocks
Async Async Tmax Sync Sync Ordering Fault detection HotOS XVII (May 14, 2019)
12
The How: Software Why is software timing so unpredictable?
Reason #1: Hardware features (caches, etc.) Not as bad as it seems: +/- 2% is possible (TDR, OSDI’14) Emerging features, such as Intel’s CAT, should help Meltdown/Spectre will probably accelerate this trend Reason #2: OS structure Linux & friends are not designed for timing stability Idea from CPS: Use elements from RT-OSes … but it will require deep structural changes! No small “synchrony patch” for Linux! HotOS XVII (May 14, 2019)
13
The How: Fault tolerance
What if things (inevitably) break? Could disrupt the careful synchronous “choreography”! Challenge #1: Telling when it breaks Actually easier with synchrony! Challenge #2: Doing something about it How to reconfigure while maintaining timing guarantees? Idea from CPS: Use mode-change protocols! System can operate in different “modes”, based on observed faults Transition from one mode to another via precomputed protocols Result: Timing is maintained during the transition HotOS XVII (May 14, 2019)
14
The How: Scheduling Can you schedule an entire data center?
Surprisingly, we are getting pretty good at it! Sparrow (SOSP’13) can schedule 100ms tasks on 10,000s of cores Idea from CPS: Compositional scheduling Schedule smaller entities (nodes? pods?) in detail Abstract and aggregate, then schedule next-larger entity Repeat until entire system is scheduled Dispatching can be done locally; so can most updates HotOS XVII (May 14, 2019)
15
Summary Synchronous data centers seem possible!
Reasons to be optimistic: Fastpass, DTP, RTOSes, … There are interesting benefits to be had! Asynchrony creates or amplifies challenges like fault detection, congestion control, consistency, tail latencies, load balancing, performance debugging, algorithmic complexity, … These problems could become simpler, or go away entirely! But much work remains to be done! Not much existing work on DC-scale synchronous systems Can we adapt some ideas from cyber-physical systems? Questions? HotOS XVII (May 14, 2019)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.