Reliable I/O on the Grid Douglas Thain and Miron Livny Condor Project University of Wisconsin
Outline A Practical Problem Half-Interactive Jobs Solution: The Grid Console Philosophical Musings A New System: Kangaroo
Problem: “Half-Interactive” Jobs Users want to submit batch jobs to the Grid, but still be able to monitor the output interactively. But, network failures are expected as a matter of course, so keeping the job running takes priority over getting output. Examples: INFN: Collider event simulation and reconstruction with CMS NCSA: Modelling with Gaussian
Existing Tools are not Sufficient Installing a uniform world-wide DFS is not feasible. Even if it were: NFS: disconnect causes delay AFS: close() can fail?!? Condor Vanilla: dependent on file system. Standard: disconnect causes rollback. GASS Staging mode: no incremental output. Append mode: no easy failure recovery.
Solution: The Grid Console Trap reads and writes on stdio and send them via RPCs to be executed at the home site. If connection is lost, just keep writing to disk but retry connection periodically. If re-made, send all spooled data back and then continue operation.
Solution: The Grid Console APP GC SHADOW Execution SiteStorage Site BYPASS GC AGENT FILE SYSTEM SPOOL DIR RPC on TCP Stdin, stdout, stderr Existing storage system: NFS, AFS, GASS, etc. Other files Globus Auth
Observations on the Grid Console Interfaces well with existing systems: Applied to vanilla Condor(G) jobs. Works on any dynamically-linked program. Undesired properties: Only applies to standard streams. Job is blocked during recovery mode. Strange property: Disconnected mode might be faster than connected mode! Can we have it both ways?
Philosophical Musings What have we done? Hidden errors Job is not designed to deal with unusual error conditions: – –Write -> disconnected? – –Close -> host not found? Hidden latency Job is not designed to deal with slow I/O. It assumes that I/O ops are low latency, or at least appear to be. GC could be better at this.
Philosophical Musings, #2 These problems are one and the same: Hiding errors: Retry, report the error to a third party, and use another resource to satisfy the request. Hiding latency: Use another resource to satisfy the request in the background, but if an error occurs, there is no channel to report it. Reliability is not a binary property. A slow link can be just as damaging to throughput as a disconnection.
Philosophical Musings, #3 A traditional OS deals with these same problems when it uses memory to buffer disk operations. Let’s apply the same principle to the Grid: Use memory and disk to satisfy unscheduled I/O operations in the background.
Introducing Kangaroo - A user-level data movement system that ‘hops’ files piecemeal from node to node on the Grid. - A background process that will ‘fight’ for your jobs’ I/O needs. - A ‘damage control’ specialist that will give errors to a third party but never admit failure to the job.
Our Vision: A Grid File System File System File System File System K K K K K K K Data Movement System App Disk
Kangaroo Prototype We have built a first-try Kangaroo that validates the central ideas of error and latency hiding. Emphasis on high-level reliability and throughput, not on low-level optimizations. First, work to improve writes, but leave room in the design to improve reads.
User Interface Like the GC, attach standard applications with Bypass. A tool for trapping UNIX I/O operations and routing them through new code. Works on any dynamically-linked, unmodified program. Examples: setenv LD_PRELOAD pfs_agent.so vi kangaroo://coral.cs.wisc.edu/etc/hosts gcc gsiftp://ftp/input.c -o kangaroo://host/out
Kangaroo Prototype APP KANGAROO AGENT K SERVER SPOOL DIR K MOVER K SERVER FILE SYSTEM Execution SiteStorage Site BYPASS Writes Reads
Microbenchmark: File Transfer Create a large output file at the execution site, and send it to a storage site. Ideal conditions: No competition for cpu, network, or disk bandwidth. Three methods: Stream output directly to target. Stage output to disk, then copy to target. Kangaroo
Macrobenchmark: Image Processing Post-processing of satellite image data: Need to compute various enhancements and produce output for each. Read input image For I=1 to N – –Compute transformation of image – –Write output image Example: Image size about 5 MB Compute time about 6 sec IO-cpu ratio.91 MB/s
I/O Models for Image Processing OUTPUT CPU OUTPUT Online I/O: Offline I/O: Current Kangaroo: INPUT OUTPUT CPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU PUSH
Summary of Results At the micro level, our prototype provides reliability with reasonable performance. At the macro level, I/O overlap gives reliability and speedups (for some applications.) Kangaroo allows the application to survive on its real I/O needs:.91 MB/s. Without it, there is ‘false pressure’ to provide fast networks.
Research Problems Virtual Memory A K-node has one input, one output, and a memory/disk buffer. How should we move data to maximize throughput? File System Existing spool directory is clumsy and inefficient. Need a fs optimized for 1-write, 1-read, 1-delete. Fine-Grained Scheduling Reads should have priority over writes. This is easy at one node, but multiple nodes?
Conclusion The Grid is BYOFS. Error hiding and latency hiding are tightly- knit problems. The solution to both is to overlap I/O and computation. The benefits of high-level overlap can outweigh any low-level inefficienies.
Conclusion Need more info? Demo time: Wednesday, 9-12 AM Room 3381 CS Questions now?