the Virtual Data Toolkit distributed by the Open Science Grid Richard Jones University of Connecticut CAT project meeting, June 24, 2008
2 the UConn Grendl cluster 62 dual-processor nodes mix of old and newer cpu’s 7 TB of shared storage condor job management heavy reliance on nfs home-built processing workflow package called “openShop”
CAT project meeting, June 24, the UConn Grendl cluster Efficient MPI job scheduling using the condor “parallel universe” Large datasets staged on large distributed “parallel virtual file system” (pvfs) volumes high throughput low cost – no dedicated file servers reduced cpu – data location coupling
CAT project meeting, June 24, Obstacles to scaling nfs servers x clients = N 2 problem 1 server down hangs/drags all N clients starts to be a problem with 62 nodes cross-site nfs is an admin nightmare! pvfs 1 server down hangs entire volume poor recovery, compared to nfs invasive installation procedure
CAT project meeting, June 24, CAT project scaling data large base-input datasets non-volatile non-replicated relatively compact PWA event lists volatile replicated complex workflow pattern global management scheme is needed
CAT project meeting, June 24, CAT project scaling processor co-scheduling cpu resource allocation in clusters network latency allocation persistence not tied to client location access independent of local userid global resource monitoring required