CamGrid: Experiences in constructing a university-wide, Condor-based, grid at the University of Cambridge Mark Calleja Bruce Beckles University of Cambridge.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
CSCI 530 Lab Firewalls. Overview Firewalls Capabilities Limitations What are we limiting with a firewall? General Network Security Strategies Packet Filtering.
Module 5: Configuring Access for Remote Clients and Networks.
Technical Review Group (TRG)Agenda 27/04/06 TRG Remit Membership Operation ICT Strategy ICT Roadmap.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Lesson 1: Configuring Network Load Balancing
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Institute of Technology, Sligo Dept of Computing Semester 3, version Semester 3 Chapter 3 VLANs.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Firewall 2 * Essential Network Security Book Slides. IT352 | Network Security |Najwa AlGhamdi 1.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
(part 3).  Switches, also known as switching hubs, have become an increasingly important part of our networking today, because when working with hubs,
1 Network File System. 2 Network Services A Linux system starts some services at boot time and allow other services to be started up when necessary. These.
Networking Components Chad Benedict – LTEC
Firewalls and the Campus Grid: an Overview Bruce Beckles University of Cambridge Computing Service.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Infrastructure Provision for Users at CamGrid Mark Calleja Cambridge eScience Centre
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Chapter 7: Using Windows Servers to Share Information.
Module 8: Configuring Virtual Private Network Access for Remote Clients and Networks.
11 SECURITY TEMPLATES AND PLANNING Chapter 7. Chapter 7: SECURITY TEMPLATES AND PLANNING2 OVERVIEW  Understand the uses of security templates  Explain.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Designing Active Directory for Security
Dynamic Firewalls and Service Deployment Models for Grid Environments Gian Luca Volpato, Christian Grimm RRZN – Leibniz Universität Hannover Cracow Grid.
Module 7: Fundamentals of Administering Windows Server 2008.
SECURITY ZONES. Security Zones  A security zone is a logical grouping of resources, such as systems, networks, or processes, that are similar in the.
11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Module 11: Implementing ISA Server 2004 Enterprise Edition.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Computer Emergency Notification System (CENS)
Overview of Microsoft ISA Server. Introducing ISA Server New Product—Proxy Server In 1996, Netscape had begun to sell a web proxy product, which optimized.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Fundamentals of Proxying. Proxy Server Fundamentals  Proxy simply means acting on someone other’s behalf  A Proxy acts on behalf of the client or user.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Network Operating Systems : Tasks and Examples Instructor: Dr. Najla Al-Nabhan
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
R. Krempaska, October, 2013 Wir schaffen Wissen – heute für morgen Controls Security at PSI Current Status R. Krempaska, A. Bertrand, C. Higgs, R. Kapeller,
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Securing Access to Data Using IPsec Josh Jones Cosc352.
HTCondor Security Basics
MCSA VCE
THE STEPS TO MANAGE THE GRID
Privilege Separation in Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Preventing Privilege Escalation
Presentation transcript:

CamGrid: Experiences in constructing a university-wide, Condor-based, grid at the University of Cambridge Mark Calleja Bruce Beckles University of Cambridge

Project Aims To develop a University-wide “grid”: To utilise as many CPUs as possible across the University for high throughput computing (HTC) jobs. To utilise the University’s centrally-provided computing resources à la UCL Condor. To allow flexible participation of resource owners in a safe, controlled manner, on their own terms (where this is feasible).

Project Overview Condor will be deployed across the University on as many CPUs as possible. Initially concentrating on the Linux platform, but eventually expanding to take in the Windows and MacOS X platforms. Also some IRIX and Solaris machines. Condor will provide the basic job management and ‘micro- scheduling’ required for CamGrid. More sophisticated meta-scheduling may be required and will be developed in-house if necessary. A choice of different deployment architectures will be offered to participating stakeholders. As far as possible, will seek not to impose any constraints on resource owners’ management and configuration of their own resources. As far as possible, underlying divisions of the resources in CamGrid should be transparent to the end-user – users should not need to know which machines they are allowed to run jobs on, etc.

Why Condor? Proven itself capable of managing such large-scale deployments (UCL Condor). Flexible system which allows different deployment architectures. Gives the resource owner complete control over who can use their resources and the policies for resource use. Local experience with deploying and using Condor. Interfaces with the Globus Toolkit. Development team actively developing interfaces to other “grid toolkits”. Already used within the e-Science and grid communities, and increasingly within the wider scientific academic community.

Condor Overview (1): A typical Condor Pool Checkpoint server (optional) Central Manager Submit hosts Execute hosts These machines are both submit and execute hosts Monitors status of execute hosts and assigns jobs to them Matches jobs from submit hosts to appropriate execute hosts Checkpoint files from jobs that checkpoint are stored on checkpoint server

Condor Overview (2): A typical Condor job Central Manager Execute Host Submit Host Condor daemons (Normally listen on ports 9614 and 9618) Condor daemons condor_shadow process Send job to Execute Host. Send results to Submit Host. User’s executable code Condor libraries User’s job All system calls performed as remote procedure calls back to Submit Host. Spawns job and signals it when to abort, suspend, or checkpoint. Checkpoint file (if any) is saved to disk. Execute Host tells Central Manager about itself. Central Manager tells it when to accept a job from Submit Host. Submit Host tells Central Manager about job. Central Manager tells it to which Execute Host it should send job.

The University Computing Service (UCS) Currently, the largest resource owner in CamGrid is the University Computing Service (UCS), which provides central computing facilities for the University as a whole. Particular requirements for participation in CamGrid: Must interface smoothly with existing UCS systems and infrastructure. Must comply with UCS security policies. Must not make existing UCS systems any less secure or less stable. This gives rise to a particular requirement: CamGrid ‘middleware’ on UCS systems must follow the principle of least privilege: always run with the least security privilege necessary to achieve the current task. Condor, when run in its default configuration (the condor_master daemon running as root ), violates this principle.

Two Deployment Environments The other current participants in CamGrid are mainly small research groups within Departments (though some of these can bring Departmental-wide resources into CamGrid). In general they have more relaxed security requirements regarding Condor daemons running on their individual resources. However, they have varying and complex requirements concerning who can use their resources, whose jobs get priority, etc. Thus it became clear that two different styles of Condor deployment would be necessary for CamGrid: Environment 1: UCS owned and/or managed systems Environment 2: Condor pools managed by anyone other than the UCS (research groups, etc.)

Environment 1 Consists of two separate classes of machine: Machines owned by the UCS and provided for the use of members of the University in ‘public’ computer rooms, teaching rooms, etc. – the Personal Workstation Facility (PWF). Machines managed by the UCS on behalf and Departments and Colleges – the Managed Cluster Service (MCS).

The Personal Workstation Facility (PWF) Centrally managed workstations set up to provide a consistent environment for end- users: PCs running Windows XP Dual-boot PCs running Windows XP/SuSE Linux 9.0 Apple Macintosh computer running MacOS X v10.3 Not behind a firewall, public (‘globally visible’) IP addresses Of the order of 600 such workstations

The Managed Cluster Service (MCS) A service offered by the UCS to Departments and Colleges of the University: Institution buys equipment of a given specification UCS installs workstations as a PWF in the institution, and remotely manages it on behalf of the institution in conjunction with the institution’s local support staff Institutions can have their MCS customised (installations of particular software applications, restrictions on who can use the machines, etc.) if they wish This service has proved extremely popular and has increased the number of ‘PWF-style’ workstations in the University significantly. Currently planning to deploy Condor as part of CamGrid (appropriately customised) to agreeable institutions who already participate in the MCS (at no additional cost).

Environment 1 architecture (1) Will consist of a mixture of UCS-owned PWF workstations and MCS workstations belonging to willing institutions. Authentication will be based on existing user authentication for the PWF. (PWF accounts are freely available to all current members of the University.) UCS-owned workstations will accept jobs from any current member of the University with a PWF account. MCS workstations belonging to other institutions will accept jobs according to the institution’s policies (e.g. only jobs from members of the Department, priority to jobs from members of the Department, etc.)

Environment 1 architecture (2) Will use a dedicated Linux machine as the central manager. Initially only a single (Linux) submit host (more added as necessary to cope with more execute hosts) and 1TB of short-term storage will be provided. Access to this submit host will initially be via SSH only. All daemon-daemon communication will be authenticated via Kerberos. Initially only run jobs on PWF Linux workstations. No checkpoint server will be provided (maybe later if there is sufficient demand).

Environment 1 architecture (3) Central Manager Submit Host Execute host returns results to Submit Host. Submit Host tells Central Manager about a job. Central Manager tells it to which Execute Host it should send job. Each Execute Host tells Central Manager about itself. Central Manager tells each Execute Host when to accept a job from Submit Host. 1 TB SSH access only 1TB of dedicated short-term storage Send job to Execute Host. All daemon-daemon (i.e. machine-machine) communication in the pool is authenticated via Kerberos. PWF Linux machines (Execute Hosts) UCS Kerberos domain controller All machines will make use of UCS Kerberos infrastructure

Environment 1: Security considerations As provided, Condor cannot run securely under UNIX/Linux on submit and execute hosts unless it is started as root. (Condor on a dedicated central manager, as is being used here, does not require root privilege.) As previously noted, this is not acceptable in this environment. Solution is to not run Condor as root and provide secure mechanisms to enable it to make use of elevated privilege where necessary. How do we do this?

Least privilege: Execute host (1) On execute hosts, Condor makes use of two privileges of root : Ability to switch to any user context ( CAP_SETUID capability) Ability to send signals to any process (CAP_KILL capability) Note that if you have the ability to switch to any user context you effectively have the ability to send signals to any process. Condor provides a facility (USER_JOB_WRAPPER) to pass the Condor job to a “wrapper” script, which is then responsible for executing the job. GNU userv allows one process to invoke another (potentially in a different user context) in a secure fashion when only limited trust exists between them (see ).

Least privilege: Execute host (2) Putting these together we arrive at the following solution: No Condor daemon or process runs as root Condor passes job to our wrapper script Our script installs a signal handler to intercept Condor’s signals to the job Our script fork() ’s a userv process which runs the job in a different (also non-privileged) user context On receipt of a signal from Condor, our script calls userv to send the signal to the job userv uses pipes to pass the job’s standard output and standard error back to our script (and so to Condor) A cron job (using Condor’s STARTD_CRON mechanism) runs once a minute to make sure that if there is no job executing then there are no processes running in our dedicated “Condor job” user context Voilà!

Least privilege: Execute host (3) What does this gain us? If there is a vulnerability in Condor then the entire machine is not compromised as a result. We do not have any Condor processes, whose real user ID is root, listening to the network. Much greater control over the job’s execution environment – could run the job chroot ’d if desired. If a Condor job leaves behind any processes running after job completion they will be killed – normally, Condor only does this properly if specially configured (EXECUTE_LOGIN_IS_DEDICATED).

Least privilege: Execute host (4) What do we lose? Can no longer suspend jobs (cannot catch SIGSTOP) – not a problem as we only want jobs to either run or be evicted. We now need to handle passing the job environment variables, setting resource limits and scheduling priority, etc., which Condor would normally handle. Condor can no longer correctly distinguish between the load Condor processes and Condor jobs are putting on the machine, and the load generated by other processes. May be problems with jobs run in Condor’s Standard universe (testing required).

Least privilege: Submit host This has yet to be implemented, but, in principle it would work as follows: No Condor daemon runs as root Condor is told to call a “wrapper” script or program instead of condor_shadow when a job starts on an execute host (just change the filenames specified in the SHADOW and SHADOW_PVM macros). “Wrapper” installs a signal handler and then fork() ’s userv to call condor_shadow in the context of the user who submitted the job. userv allows us to pass arbitrary file descriptors to the process we’ve invoked and we may need to use this to handle the network sockets normally inherited by condor_shadow.

Least privilege: A better way Under Linux, we use the dynamically linked version of Condor. So, using LD_PRELOAD we can intercept function calls Condor makes to signal processes (and also to change user context). With a specially written library we could then catch such calls and handle them via userv, thus transparently allowing Condor to function as intended. Maybe…!? (Intend to investigate such a solution as a long term strategy of the UCS deployment of Condor.)

Environment 2 Aims to federate any willing Condor pools across the university into one flock. Must allow for many complications, e.g: 1) Department firewalls 2) “Private” IP addresses, local to departments/colleges. 3) Diverse stakeholder needs/requirements/worries. 4) Licensing issues: “Can I pre-stage an executable on someone else’s machine?” Initially we’ve started with just a small number of departments, and have considered two different approaches to solving 1) and 2): one that uses a VPN and another that utilises university-wide “private” IP addresses.

A VPN approach Uses secnet: Each departmental pool configures a VPN gateway, and all other machines in the pool are given a second IP address. All Condor-specific traffic between pools goes via these gateways, so requires minimal firewall modifications. Also means that only the gateway needs a “globally” visible IP address. Inter-gateway traffic is encrypted. Initial tests with three small pools worked well, though hasn’t been stress tested. However, there are unanswered security issues since we’re effectively bypassing any resident firewall.

Environment 2 VPN test bed architecture Department of Earth Sciences CeSC NIeES rbru cartman GW, CM esmerelda GW tempo GW, CM tiger ooh CM Environment 2 VPN test bed architecture

University-wide IP addresses Give each department a dedicated subnet on a range of university-routable addresses Each machine in a department’s Condor pool is given one of these addresses in addition to its original one (á la VPN). Condor traffic is now routed via conventional gateway. Department firewalls must be configured to allow through any external traffic arising from flocking activities. We’ve only just started testing this model: involves more work for our Computer Officer but he can sleep easier. Performance appears comparable to the VPN approach, but has involved more work in setting up.

A Proposal for a Federated CamGrid (1) It is still not yet clear how Environment 1 and Environment 2 will interact, and a number of possibilities are under active investigation. One proposal is to make use of the UCS provided submit node(s) as a central submission point for CamGrid: Makes firewall administration significantly easier (à la VPN). Makes the underlying divisions between groups of resources (e.g. those in Environment 1 and those in Environment 2) transparent to end-users. Makes it easy for resource owners who do not wish to set up their own Condor pools to join CamGrid, whilst still supporting those with their own Condor pools. Preserves autonomy of resource owners, who can still have separate arrangements between their Condor pools if desired (e.g. for cross- Departmental research projects). As under Condor the policies to determine what jobs can run on a resource are configured and enforced on that resource itself, resource owners still maintain complete control of their resources.

A Proposal for a Federated CamGrid (2) Submit Host SSH access only Department One Department Two A C PWF Machines in A and C must have public IP addresses or University- wide private IP addresses. PWF machines have public IP addresses. Departmental Firewall only needs to allow traffic from one host through, regardless of the number of machines participating in CamGrid, and it can apply some of its rules to filter traffic from that host if desired.