Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk.

Slides:



Advertisements
Similar presentations
Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Advertisements

CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A quick introduction to CamGrid University Computing Service Mark Calleja.
Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Installing and running COMSOL on a Windows HPCS2008(R2) cluster
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Infrastructure Provision for Users at CamGrid Mark Calleja Cambridge eScience Centre
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Lecturer: Ghadah Aldehim
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Dept. of Architecture Ina Smith UPSpace Manager.
Grid Computing I CONDOR.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Manchester Particle Physics Policies and Computing Model.
Common Practices for Managing Small HPC Clusters Supercomputing 12
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Using the ARCS Grid and Compute Cloud Jim McGovern.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
SQL Database Management
Accessing the VI-SEEM infrastructure
Condor DAGMan: Managing Job Dependencies with Condor
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Operating System.
Dag Toppe Larsen UiB/CERN CERN,
Virtualisation for NA49/NA61
Condor: Job Management
US CMS Testbed.
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
Genre1: Condor Grid: CSECCR
Presentation transcript:

Introduction to CamGrid Mark Calleja Cambridge eScience Centre

Why grids? The idea comes from electricity grids: you don’t care which power station your kettle’s using. Also, there are lots of underutilised resources around. The trick is to access them transparently. Not all resources need to be HPC with large amounts of shared memory and fast interconnects. Many research problems are “embarrassingly parallel”, e.g. phase space sampling. We’d like to use “anything”: dedicated servers or desktops.

What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing (HTC) facility. Machines in a Condor pool can submit and/or service jobs in the pool. Highly configurable of how/when/whose jobs can run. Condor has several useful mechanisms such as : –Process checkpoint/ restart / migration –MPI support (with some effort) –Failure resilience –Workflow support

Getting Started: Submitting Jobs to Condor Choosing a “Universe” for your job (i.e. sort of environment the job will run in): vanilla, standard, Java, parallel (MPI)… Make your job “batch-ready” (namely stdin) Must be able to run in the background: no windows, GUI, etc. Creating a submit description file Run condor_submit on your submit description file

A Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) Universe = vanilla Executable = job.$$(OpSys).$$(Arch) InitialDir = /home/mark/condor/run_$(Process) Input = job.stdin Output = job.stdout Error = job.stderr Arguments = arg1 arg2 Requirements = Arch == “X86_64” && OpSys == “Linux” Rank = KFlops Queue 100

DAGMan – Condor’s workflow manager Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. Allows complicated workflows to be built up (can embed DAGs). E.g., “Don’t run job “B” until job “A” has completed successfully.” Failed nodes can be automatically retried.

Condor Flocking Condor attempts to run a submitted job in its local pool. However, queues can be configured to try sending jobs to other pools: “flocking”. User-priority system is “flocking-aware” –A pool’s local users can have priority over remote users “flocking” in. This is how CamGrid works: each group/department maintains its own pool and flocks with the others.

CamGrid Started in Jan 2005 by five groups (now up to eleven; 13 pools). UCS has its own, separate Condor facility known as “PWF Condor”. Each group sets up and runs its own pool, and flocks to/from other pools. Hence a decentralised, federated model. Strengths: –No single point of failure –Sysadmin tasks shared out Weaknesses: –Debugging is complicated, especially networking issues. –Many linux variants: can cause library problems.

Participating departments/groups Cambridge eScience Centre Dept. of Earth Science (2) High Energy Physics School of Biological Sciences National Institute for Environmental eScience (2) Chemical Informatics Semiconductors Astrophysics Dept. of Oncology Dept. of Materials Science and Metallurgy Biological and Soft Systems

Local details (1) CamGrid uses a set of RFC 1918 (“CUDN-only”) IP addresses. Hence each machine needs to be given an (extra) address in this space. A CamGrid Management Committee, with members drawn from participating groups, maps out policy. Currently have ~1,000 core/processors, mostly 4-core Dell 1950 (8GB memory) like HPCF. Aside: SMP/MPI works very nicely! Pretty much all linux, and mostly 64 bit. Administrators can decide configuration of their pool, e.g. such issues as: - Extra priority for local users - Renice Condor job - Only run at certain times - Have a preemption policy

Local details (2) Responsibility of individual pools to authenticate local submitters. Need to trust root on remote machine, especially for Standard Universe. There’s no shared FS across CamGrid, but Parrot (from the Condor project) is a nice user-space file system tool for linux. Means a job can mount a remote data source like a local file system (á la NFS). Firewalls: a submit host must be able to communicate with every possible execute node. However, can have a well defined port range. Two mailing lists set up: one for users (92 currently registered) and the other for sysadmins. Have a nice web-based utility for viewing job files in realtime on execute hosts.

41 refereed publications to date, (Science, Phys. Rev. Lett., PLOS,…)

USERS YOUR GRID GOD SAVE THE GRID

How you can help us help you Pressgang local resources. Why aren’t those laptops/desktops on CamGrid? When applying for grants, please ask for funds to put towards computational resources (~£10k?) Publications, publications, publications! Please remember to mention CamGrid and inform me of accepted articles. Evangelise locally, especially to hierarchy. Tell us what you’d like to see (centralised storage, etc.)

reports books images audio papers research data pdf data preprints documents eprints PhD learning objects TIFF bitstreams scholarly conference papers video text articles xml working papers web pages digital theses multimedia statistics manuscripts photos source code We can archive your digital assets… Elin Stangeland, Repository Manager

Take home message It works. Cranked out 386 years of CPU usage since Feb ’06 (King James I, Jamestown Massacre). Those who put the effort in and get over the initial learning curve are very happy with it: “Without CamGrid this research would simply not be feasible.” – Prof. Bill Amos (population geneticist ) “We acknowledge CamGrid for invaluable help." – Prof. Fernando Quevedo (theoretical physicist) Does not need outlay for any new hardware and the middleware’s free (and open source). This is a grass-roots initiative: you need to help recruit more/newer resources.

Links CamGrid: Condor: Questions?