Mayflower: Improving Distributed Filesystem Performance Through DFN/Filesystem Co-Design Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka Denjamin Cassel Presented: Yihan Li
Outline Motivation Introduction Design Overview Replica and Path Selection Evaluation Conclusion
Motivation Network is the performance bottleneck Distributed filesystems are the primary bandwidth consumers Oversubscribed network architectures High-performance SSDs Current distributed filesystems and network control planes are designed independently Only use static network information They are not reciprocally involved in making network decisions To perform replica selection based on network distance Does not capture dynamic resource contention Does not capture network congestion
What is Mayflower (I) Mayflower is co-designed from ground up with a Software-Defined Networking (SDN) control plane It consists of three main components Dataserver Nameserver Flowserver It can perform path selection for other applications Read requests Through a public interface
What is Mayflower (II) Dataservers Performs reads from and appends to file chunks Nameserver Manages the file to chunk mapping Flowserver To run alongside the SDN controller Models the path bandwidth of the elephant flows Performs both replica and network path selection
Advantage It enables both filesystem and network decisions to be made collaboratively by the filesystem and network control plane Mayflower evaluates all possible paths between the client and all of the replica hosts It can directly minimize average request completion time Expected completion time of the pending request Expected increase in completion time of other in-flight requests It can determine if read concurrently from multiple replica hosts
Design Overview (I) Five assumptions The system only stores a modest number of files Most reads are large and sequential, and clients often fetch entire files File writes are primarily large sequential appends to files (random writes are very rare) The workloads are heavily read-dominant The network is the bottleneck
Design Overview (II) Select both the replica and the network path for the Mayflower read operations Estimating current network and make selections It can work together with existing network managers It periodically fetches the flow stats from the edge switches (for avoiding error) It re-computes an estimate of the path bandwidth (ensure completion time estimates are accurate)
File Read Operation
Design Overview (III) Mayflower provides sequential consistency by default Mayflower provides linearizability with respect to read and append requests Sending the last chunk’s read requests to the primary replica host Vast majority of chunks can be serviced by any replica host Most chunks are essentially immutable System delays the delete for T time (maximum expiration period) for consistency
Replica-path Selection Algorithm Based on estimated network state Bandwidth estimations Remaining flow size approximations Target performance metrics Average job completion times Must account for the effect on existing flows New flows affect the path selection for already scheduled flows
Problem Statement (I) Optimization goal Select the network path that minimize the completion time of both the new flow as well as existing flows The algorithm considers The paths of existing flows The capacity of each link The data size of each request The estimated bandwidth shares of existing flows The remaining un-transferred data size of existing flows
Problem Statement (II) G: paths from source to destination ci,j: cost of impact on existing flows bi,j: bottleneck bandwidth di,j: data flow Ii,j: binary indicator S: super source t: sink node x: data size
Replica-Path Selection Process (I)
Replica-Path Selection Process (II) The first portion: estimates the cost of the new flow The second portion: estimates the impact of the flow on existing flows Fp in path p Bandwidth share of the existing flows: max-min fair share calculations Unknown flow size: use an estimate size (average elephant flow size) Slack in updating bandwidth utilization: the bandwidth utilization for the new flow is set to its estimated bandwidth share Existing flows: updated with their new estimated values
Replica-Path Selection Process (III)
Replica-Path Selection Process (IV) Reading from multiple replicas for reducing the completion time Total cost Size of sub-flow bandwidth
Evaluation (I) 13 machines 64 GB RAM 200 GB Intel S3700 SSD Experimental Setup 13 machines 64 GB RAM 200 GB Intel S3700 SSD Two Intel Xeon E5-2620 Mellanox SX6012 switch via 10 Gbps links 64 virtual hosts, four pads (each with three physical machines)
Traffic Matrix Job arrival follows the Poisson distribution File read popularity follows the Zipf distribution with the skewness parameter equals to 1.1 R: a client is placed in the same rack as the primary replica P: in another rack but in the same pod O = 1 – R – P: in a different pod Pod 1 Pod 2 Pod 3 Pod 4 Rack 1 Rack 2 Rack 3 Rack 4
Paths to all replicas are partially congested Evaluation (II) Paths to all replicas are partially congested
Evaluation (III)
Evaluation (IV) Mayflower is effective at avoiding congestion point
Evaluation (V)
Interdependence between the network and the applications Read requests Evaluation (VI) Interdependence between the network and the applications Background flow
Evaluation (VII)
Evaluation (VIII)
Conclusions How Mayflower improves read performance Distributed filesystem that follows a network/filesystem co-design approach It provides a novel replica and network path selection algorithm Evaluation