CamGrid Mark Calleja Cambridge eScience Centre
What is it? A number of like minded groups and departments (10), each running their own Condor pool(s), which federate their resources (12). Coordinated by the Cambridge eScience Centre (CeSC), but no overall control. Been running now for ~2.5 years, ~70+ users. Currently have ~950 processors/cores available. All linux (various), mostly x86_64, running 24/7. Mostly Dell PowerEdge 1950 (like HPCF), four cores with 8GB. Around 2M CPU hours to date.
Some details Pools run the latest stable version of Condor (currently 6.8.6). All machines get an (extra) IP address in a CUDN-only routeable range for Condor. Each pool sets its own policies, but these must be visible to other users of CamGrid. Currently we see vanilla, standard and parallel (MPI) universe jobs. Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism. MPI jobs on single SMP machines have proved very useful.
NTE of Ag 3 [Co(CN) 6 ] with SMP/MPI sweep
Monitoring Tools A number of web based tools provided to monitor the state of the grid and of jobs. CamGrid is based on trust, so must make sure that machines are fairly configured. The university gave us £450k (~$950k) to buy new hardware; need to ensure that its online as promised.
CamGrids file viewer Standard universe uses RPCs to echo I/O operations back to submit host. What about other universes? How can I check the health of my long running simulation? Weve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface. Works with vanilla and parallel (MPI) jobs. Requires local sysadmins to install and run it.
CamGrids file viewer
Checkpointable vanilla universe Standard universe is fine, if you can link to Condors libraries (Pete Keller – getting harder). Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux. Uses kernel resources, and can thus restore resources that user-level libraries cannot. Supported by some flavours of MPI (late LAM, OpenMPI). The idea was to use Parrots user-space FS to wrap a vanilla job and save the jobs state on a chirp server. However, currently Parrot breaks some BLCR functionality.
What doesnt work so well… Each pool is run by local sysadmin(s), but these are of variable quality/commitment. Weve set up mailing lists for users and sysadmins: hardly ever used (dont want to advertise ignorance?). Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty… Dont get me started on merger with UCSs central resource (~400 nodes).
But generally were happy bunnies CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week." -- Dr. Ben Allanach CamGrid was essential in order for us to be able to run the different codes in real time. -- Prof. Fernando Quevedo I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication. -- Dr. Karen Lipkow
Current issues Protecting resources on execute nodes; Condor seems lax at this, e.g. memory, disk space. Increasingly interested in VMs (i.e. Xen). Some pools run it, but not concerted (effects on SMP MPI jobs?). Green issues: will we be forced to buy WoL cards in the near future? Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail? How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.
Finally… CamGrid: Contact: Questions?