Performance-Robust Parallel I/O Virtual Streams Performance-Robust Parallel I/O Z. Morley Mao, Noah Treuhaft CS258 5/17/99 Professor Culler
Introduction Clusters exhibit performance heterogeneity static & dynamic, due to both hardware and software Consistent peak performance demands adaptive software building performance-robust parallel software means keeping heterogeneity in mind This work explores… adaptivity appropriate for I/O-bound parallel programs how to provide that adaptivity
Heterogeneity demands adaptivity Cluster Node Process Disk ... Physical I/O streams are simple to build and use But their performance is highly variable different drive models, bad blocks, multizone behavior, file layout, competing programs, host bottlenecks I/O-bound parallel programs run at rate of slowest disk
Virtual Streams Performance-robust programs want virtual streams that... eliminate dependence on individual disk behavior continually equalize throughput delivered to processes Process Virtual Streams Layer Disk
Graduated Declustering (GD): a Virtual Streams implementation data replicated (mirrored) for availability use replicas to provide performance availability, too fast network makes remote disk access comparable to local distributed algorithm for adaptivity client provides information about its progress server reacts by scheduling requests to even out progress client A client B Process GD client library GD server server server A B
GD in action Local decisions yield global behavior Before Perturbation To Client0 Before Perturbation After Perturbation 1 2 3 B Client1 Client2 Client3 Server0 Server1 Server2 Server3 From B/2 7B/8 3B/8 5B/8 B/4
Evaluation of original GD implementation: progress-based Seek overhead due to reading from all replicas Seek overhead
Deficiency of original GD implementation: seek overhead Under the assumption of sequential data access: Seek occurs even when there is no perturbation seeks are becoming more significant as disk transfer rate increases Need a new algorithm, that ... reads mostly from a single disk under no perturbation dynamically adjusts to perturbation when necessary achieves both performance adaptivity and minimal overhead
Proposed solution: response-rate-based GD Number of requests clients send to server based on server response rate servers use request queue lengths to make scheduling decisions uses implicit information, “historyless” no bandwidth information transmitted between server and client advantage: each client has a primary server
Evaluation of response-rate-based GD Graph of bandwidth vs. disk nodes perturbed Reduced Seek overhead
Historyless vs. History-based adaptiveness History-based: (progress based) Adjustment to perturbation occurs gradually over time Close to perfect knowledge, if the information not outdated extra overhead in sending control information Historyless: (response-rate based) primary server designation possible to increase sensitivity to real perturbation by creating “artificial” perturbation considers varying performance of data consumers takes longer to converge
Stability and Convergence How long does it take for the system to converge? Linear with the number of nodes Depends on the last occurrence of perturbation Influenced by the style of communication (implicit vs. explicit)
Server request handoff If a server finishes all its requests, it will contact other servers with the same replicas to help serve their clients (workstealing) server request handoff keeps all disks busy when possible design decisions? How many requests to handoff? Depending on the BW history of both servers, depending on the size of request queue. Benefit vs. Cost tradeoff
Writes Identical to reads except... Create incomplete replicas with “holes” track “holes” in metadata afterward, do “hole-filling” both for availability and for performance robustness Process
Conclusions What did we achieve? New load balancing algorithm--response-rate based Deliver equal BW to parallel-program processes in face of performance heterogeneity demonstrate the stability of the system reduce seek overhead server request handoff writes creates a useful abstraction for steaming I/O in clusters
Future Work Future work: hot file replication get peak BW after perturbation ceases achieve orderly replies multiple disks abstraction