Download presentation
Presentation is loading. Please wait.
Published bySabrina Hodge Modified over 9 years ago
1
John Bent Computer Sciences Department University of Wisconsin-Madison johnbent@cs.wisc.edu http://www.cs.wisc.edu/condor Explicit Control in a Batch-aware Distributed File System
2
www.cs.wisc.edu/condor Focus of work › Harnessing, managing remote storage › Batch-pipelined I/O intensive workloads › Scientific workloads › Wide-area grid computing
3
www.cs.wisc.edu/condor Batch-pipelined workloads › General properties Large number of processes Process and data dependencies I/O intensive › Different types of I/O Endpoint Batch Pipeline
4
www.cs.wisc.edu/condor Batch-pipelined workloads Endpoint Batch dataset Pipeline Endpoint Pipeline
5
www.cs.wisc.edu/condor Wide-area grid computing Home storage Internet
6
www.cs.wisc.edu/condor Cluster-to-cluster (c2c) › Not quite p2p More organized Less hostile More homogeneity Correlated failures › Each cluster is autonomous Run and managed by different entities › An obvious bottleneck is wide-area Internet Home store How to manage flow of data into, within and out of these clusters?
7
www.cs.wisc.edu/condor Current approaches › Remote I/O Condor standard universe Very easy Consistency through serialization › Prestaging Condor vanilla universe Manually intensive Good performance through knowledge › Distributed file systems (AFS, NFS) Easy to use, uniform name space Impractical in this environment
8
www.cs.wisc.edu/condor Pros and cons Practical Easy to use Leverages workload info Remote I/O √√ X Pre-staging √ X √ Trad. DFS X √ X
9
www.cs.wisc.edu/condor BAD-FS › Solution: Batch-Aware Distributed File System › Leverages workload info with storage control Detail information about workload is known Storage layer allows external control External scheduler makes informed storage decisions › Combining information and control results in Improved performance More robust failure handling Simplified implementation Practical Easy to use Leverages workload info BAD-FS √√√
10
www.cs.wisc.edu/condor › User-level; requires no privilege › Packaged as a modified Condor system › A Condor system which includes BAD-FS › General; glide-in works everywhere Practical and deployable Internet SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS Home store
11
www.cs.wisc.edu/condor BAD-FS == Condor ++ Condor DAGMan Compute node Condor startd Compute node Condor startd Compute node Condor Startd Compute node Condor startd Job queue 12 34 Home storage Job queue 3) Expanded Condor submit language Condor DAGMan ++ 4) BAD-FS scheduler 1) NeST storage management 2) Batch-Aware Distributed File System NeST BAD-FS
12
www.cs.wisc.edu/condor BAD-FS knowledge › Remote cluster knowledge Storage availability Failure rates › Workload knowledge Data type (batch, pipeline, or endpoint) Data quantity Job dependencies
13
www.cs.wisc.edu/condor Control through lots › Abstraction that allows external storage control › Guaranteed storage allocations Containers for job I/O e.g. “I need 2 GB of space for at least 24 hours” › Scheduler Creates lots to cache input data Subsequent jobs can reuse this data Creates lots to buffer output data Destroys pipeline, copies endpoint Configures workload to access lots
14
www.cs.wisc.edu/condor Knowledge plus control › Enhanced performance I/O scoping Capacity-aware scheduling › Improved failure handling Cost-benefit replication › Simplified implementation No cache consistency protocol
15
www.cs.wisc.edu/condor I/O scoping › Technique to minimize wide-area traffic › Allocate lots to cache batch data › Allocate lots for pipeline and endpoint › Extract endpoint › Cleanup AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint BAD-FS Scheduler Compute node Internet Steady-state: Only 5 of 705 MB traverse wide-area.
16
www.cs.wisc.edu/condor Capacity-aware scheduling › Technique to avoid over-allocations › Scheduler has knowledge of Storage availability Storage usage within the workload › Scheduler runs as many jobs as fit › Avoids wasted utilizations › Improves job throughput
17
www.cs.wisc.edu/condor Improved failure handling › Scheduler understands data semantics Data is not just a collection of bytes Losing data is not catastrophic Output can be regenerated by rerunning jobs › Cost-benefit replication Replicates only data whose replication cost is cheaper than cost to rerun the job › Can improve throughput in lossy environment
18
www.cs.wisc.edu/condor Simplified implementation › Data dependencies known › Scheduler ensures proper ordering › Build a distributed file system With cooperative caching But without a cache consistency protocol
19
www.cs.wisc.edu/condor Real workloads › AMANDA Astrophysics study of cosmic events such as gamma-ray bursts › BLAST Biology search for proteins within a genome › CMS Physics simulation of large particle colliders › HF Chemistry study of non-relativistic interactions between atomic nuclei and electrons › IBIS Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)
20
www.cs.wisc.edu/condor Real workload experience › Setup 16 jobs 16 compute nodes Emulated wide-area › Configuration Remote I/O AFS-like with /tmp BAD-FS › Result is order of magnitude improvement
21
www.cs.wisc.edu/condor BAD Conclusions › Schedulers can obtain workload knowledge › Schedulers need storage control Caching Consistency Replication › Combining this control with knowledge Enhanced performance Improved failure handling Simplified implementation
22
www.cs.wisc.edu/condor For more information › http://www.cs.wisc.edu/condor/publications.html › Questions? “Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003. “Explicit Control in a Batch-Aware Distributed File System,” John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. NSDI ‘04, 2004.
23
www.cs.wisc.edu/condor Why not BAD-scheduler and traditional DFS? › Practical reasons Deployment Interoperability › Technical reasons Cooperative caching Data sharing Traditional DFS –assume sharing is exception –provision for arbitrary, unplanned sharing Batch workloads, sharing is rule Sharing behavior is completely known Data committal Traditional DFS must guess when to commit –AFS uses close, NFS uses 30 seconds Batch workloads precisely define when
24
www.cs.wisc.edu/condor Is capacity awareness important in real world? 1. Heterogeneity of remote resources 2. Shared disk 3. Workloads changing; some are very, very large and still growing.
25
www.cs.wisc.edu/condor User burden › Additional info needed in declarative lang. › User probably already knows this info Or can readily obtain › Typically, this info already exists Scattered across collection of scripts, Makefiles, etc. BAD-FS improves current situation by collecting this info into one central location
26
www.cs.wisc.edu/condor In the wild
27
www.cs.wisc.edu/condor Capacity-aware scheduling evaluation › Workload 64 synthetic pipelines Varied pipe size › Environment 16 compute nodes › Configuration Breadth-first Depth-first BAD-FS Failures directly correlate to workload throughput.
28
www.cs.wisc.edu/condor I/O scoping evaluation › Workload 64 synthetic pipelines 100 MB of I/O each Varied data mix › Environment 32 compute nodes Emulated wide-area › Configuration Remote I/O Cache volumes Scratch volumes BAD-FS Wide-area traffic directly correlates to workload throughput.
29
www.cs.wisc.edu/condor Cost-benefit replication evaluation › Workload Synthetic pipelines of depth 3 Runtime 60 seconds › Environment Artificially injected failures › Configuration Always-copy Never-copy BAD-FS Trade-off overhead in environment without failure to gain throughput in environment with failure.
30
www.cs.wisc.edu/condor Real workloads › Workload Real workloads 64 pipelines › Environment 16 compute nodes Emulated wide-area › Cold and warm First 16 are cold Subsequent 48 warm › Configuration Remote I/O AFS-like BAD-FS
31
www.cs.wisc.edu/condor Example workflow language: Condor DAGman › Keyword job names file w/ execute instrs › Keywords parent, child express relations › … no declaration of data job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D A B C D
32
www.cs.wisc.edu/condor Adding data primitives to a workflow language › New keywords for container operations volume: create a container scratch: specify container type mount: how the app addresses the container extract: the desired endpoint output › User must provide complete, exact I/O information to the scheduler Specify which procs use which data Specify size of data read and written
33
www.cs.wisc.edu/condor Extended workflow language job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D volume B1 ftp://home/data 1GB volume P1 scratch 500 MB volume P2 scratch 500 MB A mount B1 /data C mount B1 /data A mount P1 /tmp B mount P1 /tmp C mount P2 /tmp D mount P2 /tmp extract P1/out ftp://home/out.1 extract P2/out ftp://home/out.2 out A B C D ftp://home /data out.1out.2 B1 out P2P1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.