Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Internet Information Server 6.0. IIS 6.0 Enhancements  Fundamental changes, aimed at: Reliability & Availability Reliability & Availability Performance.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Access Control Chapter 3 Part 3 Pages 209 to 227.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
CNI Fall 1998 Access Management Requirements and Approaches Joan Gargano California Digital Library
System and Network Security Practices COEN 351 E-Commerce Security.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
© 2010 VMware Inc. All rights reserved VMware ESX and ESXi Module 3.
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Condor Project Computer Sciences Department University of Wisconsin-Madison Security in Condor.
AN INTRODUCTION TO LINUX OPERATING SYSTEM Zihui Han.
Firewalls and the Campus Grid: an Overview Bruce Beckles University of Cambridge Computing Service.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Week 2 File Systems & Unix Commands. File System Hierarchy.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Internet Packet eXchange Protocol (IPX) Network Documentation
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
Horst Severini Chris Franklin, Josh Alexander University of Oklahoma Implementing Linux-Enabled Condor in Windows Computer Labs.
SECURITY ZONES. Security Zones  A security zone is a logical grouping of resources, such as systems, networks, or processes, that are similar in the.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Chapter 10: Rights, User, and Group Administration.
Server Performance, Scaling, Reliability and Configuration Norman White.
ABone Architecture and Operation ABCd — ABone Control Daemon Server for remote EE management On-demand EE initiation and termination Automatic EE restart.
Computer Security Risks for Control Systems at CERN Denise Heagerty, CERN Computer Security Officer, 12 Feb 2003.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
LINUX Presented By Parvathy Subramanian. April 23, 2008LINUX, By Parvathy Subramanian2 Agenda ► Introduction ► Standard design for security systems ►
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Securing a Host Computer BY STEPHEN GOSNER. Definition of a Host  Host  In networking, a host is any device that has an IP address.  Hosts include.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
HTCondor Security Basics
File System Implementation
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Workload Management System
A user-friendly approach to grid security
Active Directory Administration
Operating System Structure
Monitoring HTCondor with Ganglia
Chapter 27: System Security
Privilege Separation in Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Chapter 2: System Structures
Condor Glidein: Condor Daemons On-The-Fly
BACHELOR’S THESIS DEFENSE
BACHELOR’S THESIS DEFENSE
BACHELOR’S THESIS DEFENSE
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Peter Banis, Klaus Çipi, Michael Kolar, Robert Olsen
Presentation transcript:

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service

Condor pool characteristics Large number (~1000) of similar/ identical workstations Workstations centrally managed Primary purpose of workstations not for running Condor jobs Workstations are public access machines, i.e. available to all members of institution

Fundamental requirements Condor service in this environment must be: Stable: Must not make machines any less stable Low impact: Must be unnoticeable to ordinary users Secure: Must not significantly increase the attack surface

Stability Only use the current Condor stable series, not the development series Extensive testing (months, 1000s of test jobs) on small pool of workstations Disable any features of Condor not required by users Support only limited subset of Condor functionality (only Vanilla and Java universes)

Low impact Gather usage statistics of target workstations and only allow Condor to run at periods when they would normally be idle Will not run jobs if a user is logged in Custom ClassAd attribute with number of users logged in Any user activity aggressively preempts Condor job Issue under standard Linux 2.6 kernels: USB mouse and keyboard activity not detected Control Condor jobs environment and sterilise environment after job completion Handles jobs using up all available disk space and not cleaning up after themselves, etc

Security What is our threat landscape? What are we worried about? How does this specifically relate to Condor? Specific security concerns… …and how we addressed them

Threat landscape Threats internal to the environment are at least as significant as external threats: Largest body of users (students) are untrusted No clear separation of use of machines by trusted and untrusted users Access (often wholly or largely unrestricted) to the public Internet is a core requirement: Both for normal use of the machines and for Condor jobs Firewalls are of little help

Specific security concerns (1) Reliable identification of machines: IP addresses useless as identifiers (IP spoofing) So strong authentication required: Do not significantly increase the attack surface of machines: No daemons running as root that listen to the network: Privilege separation (see following talk) Control access to the Condor pool: Easiest at point of job submission Restricted number of centralised submit nodes

Specific security concerns (2) Controlling the job execution environment: Inspect job prior to running on machine Start job in a sterile environment Sterilise environment after job has run Job run under dedicated unprivileged user account Restrict access to the Condor commands: Ideally develop separate front-end to Condor system Currently just wrapper scripts for Condor commands Can be circumvented (in some cases), so piloting service with relatively trusted users

Strong authentication Currently only available under UNIX/Linux Kerberos or GSI GSI: Flawed security paradigm (mandates daemons run as root, etc) Serious usability and scalability issues Kerberos: KDCs provide separate audit trail Plan to use Kerberos elsewhere in the University Support for Kerberos under Windows and MacOS X is being added to Condor; support for GSI is not (functional GSI libraries not available) Bug in Kerberos support in the stable series of Condor: Backported patch from development series to fix Kerberos has proved surprisingly easy to deploy and administer in our setup

Scalability / Performance condor_schedd (job queue management) doesnt scale well: Monolithic process: performs too many different tasks Uses blocking connections in stable series In our experience: Performs very badly above 4,000 jobs Falls over above 10,000 jobs Cannot handle significant numbers of short-running (less than 5 minute) jobs Job overhead is such that jobs need to be about 10 minutes long to be worth running under Condor Not much we can do about this: Add more submit nodes as demand on our service rises Educate our users to use service sensibly (e.g. batch up short running jobs) Wrap / replace Condor commands to encourage sensible behaviour / mitigate some of these problems Lobby Condor Team to re-design the condor_schedd daemon

Partitioning the pool Require ability to only allow jobs from certain users to run on certain machines: No sensible way provided to do this Restriction via lists of users or machines in configuration files / ClassAd attributes is unwieldy and doesnt scale Our method: Machines configured to only accept jobs with particular ClassAd attribute Set automatically by our wrapper scripts based on users identity On execute nodes cross check user against independently maintained and distributed (via LDAP) ACL – this prevents users falsifying the ClassAd attributes

Architectural overview Large number of centrally managed public access workstations running Linux Jobs only run when no users are logged in Centralised submit node(s) Wrappers around Condor commands Restricted (but still useful) subset of Condors functionality Machine identity strongly authenticated Improved Condor security model: Privilege separation on execute nodes Strict control of job environment

Conclusion Although Condor not designed for a hostile environment, it can be used relatively securely in such environments (some caveats naturally)… …under Linux… …but a lot of development work is required to achieve this… …and it requires the supporting infrastructure of a stable, centrally managed workstation service. Improvements to Condor would make this significantly easier: Design for a hostile environment. These days, most environments are.