CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Slides:

Advertisements

Similar presentations

Andrew McNab - Manchester HEP - 24 May 2001 WorkGroup H: Software Support Both middleware and application support Installation tools and expertise Communication.

Advertisements

University of St Andrews School of Computer Science Experiences with a Private Cloud St Andrews Cloud Computing co-laboratory James W. Smith Ali Khajeh-Hosseini.

NGS computation services: API's,

Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

© University of Reading IT Services ITS Support for e Research Stephen Gough Assistant Director of IT Services 18 June 2008.

SCARF Duncan Tooke RAL HPCSG. Overview What is SCARF? Hardware & OS Management Software Users Future.

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

PlanetLab Operating System support* *a work in progress.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

M-grid Using Ubiquitous Web Technologies to create a Computational Grid R J Walters and S Crouch 21 January 2009.

Introduction to CamGrid Mark Calleja Cambridge eScience Centre

A quick introduction to CamGrid University Computing Service Mark Calleja.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

Chapter 6 - Implementing Processes, Threads and Resources Kris Hansen Shelby Davis Jeffery Brass 3/7/05 & 3/9/05 Kris Hansen Shelby Davis Jeffery Brass.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.

Enabling Grids for E-sciencE Medical image processing web portal : Requirements analysis. An almost end user point of view … H. Benoit-Cattin,

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Infrastructure Provision for Users at CamGrid Mark Calleja Cambridge eScience Centre

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Computer System Architectures Computer System Software

UCL Site Report Ben Waugh HepSysMan, 22 May 2007.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Systems Security & Audit Operating Systems security.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Alain Romeyer - 15/06/20041 CMS farm Mons Final goal : included in the GRID CMS framework To be involved in the CMS data processing scheme.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS Design.

Grid Computing I CONDOR.

Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.

Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.

Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

The impacts of climate change on global hydrology and water resources Simon Gosling and Nigel Arnell, Walker Institute for Climate System Research, University.

1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

COMMON INTERFACE FOR EMBEDDED SOFTWARE CONFIGURATION by Yatiraj Bhumkar Advisor Dr. Chung-E Wang Department of Computer Science CALIFORNIA STATE UNIVERSITY,

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Purdue RP Highlights TeraGrid Round Table May 20, 2010 Preston Smith Manager - HPC Grid Systems Rosen Center for Advanced Computing Purdue University.

36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.

Performance analysis comparison Andrea Chierici Virtualization tutorial Catania 1-3 dicember 2010.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF CCR 2009.

Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.

Advanced Computing Facility Introduction

Hands-On Microsoft Windows Server 2008

CRESCO Project: Salvatore Raia

Cloud based Open Source Backup/Restore Tool

Building and Testing using Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Pete Gronbech, Kashif Mohammad and Vipul Davda

Presentation transcript:

CamGrid Mark Calleja Cambridge eScience Centre

What is it? A number of like minded groups and departments (10), each running their own Condor pool(s), which federate their resources (12). Coordinated by the Cambridge eScience Centre (CeSC), but no overall control. Been running now for ~2.5 years, ~70+ users. Currently have ~950 processors/cores available. All linux (various), mostly x86_64, running 24/7. Mostly Dell PowerEdge 1950 (like HPCF), four cores with 8GB. Around 2M CPU hours to date.

Some details Pools run the latest stable version of Condor (currently 6.8.6). All machines get an (extra) IP address in a CUDN-only routeable range for Condor. Each pool sets its own policies, but these must be visible to other users of CamGrid. Currently we see vanilla, standard and parallel (MPI) universe jobs. Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism. MPI jobs on single SMP machines have proved very useful.

NTE of Ag 3 [Co(CN) 6 ] with SMP/MPI sweep

Monitoring Tools A number of web based tools provided to monitor the state of the grid and of jobs. CamGrid is based on trust, so must make sure that machines are fairly configured. The university gave us £450k (~$950k) to buy new hardware; need to ensure that its online as promised.

CamGrids file viewer Standard universe uses RPCs to echo I/O operations back to submit host. What about other universes? How can I check the health of my long running simulation? Weve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface. Works with vanilla and parallel (MPI) jobs. Requires local sysadmins to install and run it.

CamGrids file viewer

Checkpointable vanilla universe Standard universe is fine, if you can link to Condors libraries (Pete Keller – getting harder). Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux. Uses kernel resources, and can thus restore resources that user-level libraries cannot. Supported by some flavours of MPI (late LAM, OpenMPI). The idea was to use Parrots user-space FS to wrap a vanilla job and save the jobs state on a chirp server. However, currently Parrot breaks some BLCR functionality.

What doesnt work so well… Each pool is run by local sysadmin(s), but these are of variable quality/commitment. Weve set up mailing lists for users and sysadmins: hardly ever used (dont want to advertise ignorance?). Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty… Dont get me started on merger with UCSs central resource (~400 nodes).

But generally were happy bunnies CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week." -- Dr. Ben Allanach CamGrid was essential in order for us to be able to run the different codes in real time. -- Prof. Fernando Quevedo I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication. -- Dr. Karen Lipkow

Current issues Protecting resources on execute nodes; Condor seems lax at this, e.g. memory, disk space. Increasingly interested in VMs (i.e. Xen). Some pools run it, but not concerted (effects on SMP MPI jobs?). Green issues: will we be forced to buy WoL cards in the near future? Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail? How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.

Finally… CamGrid: Contact: Questions?