Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

Slides:



Advertisements
Similar presentations
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Scalability.
Advertisements

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
24-Apr-03UCL HEP Computing Status April DESKTOPS LAPTOPS BATCH PROCESSING DEDICATED SYSTEMS GRID MAIL WEB WTS SECURITY SOFTWARE MAINTENANCE BACKUP.
Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The LHC Computing Grid Project Tomi Kauppi Timo Larjo.
A Model for Grid User Management Rich Baker Dantong Yu Tomasz Wlodek Brookhaven National Lab.
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
Grid Architecture Grid Canada Certificates International Certificates Grid Canada Issued over 2000 certificates Condor G Resource TRIUMF.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Computing services for the Traveling Physicist Alberto Pace CERN – Information Technology Division.
30-Jun-04UCL HEP Computing Status June UCL HEP Computing Status April DESKTOPS LAPTOPS BATCH PROCESSING DEDICATED SYSTEMS GRID MAIL WEB WTS.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.
Group Computing Strategy Introduction and BaBar Roger Barlow June 28 th 2005.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
Chapter 23 Internet Authentication Applications Kerberos Overview Initially developed at MIT Software utility available in both the public domain and.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
Page 1 of 9 NFS Vendors Conference October 25, 2000 PC Solutions to Network File Systems.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
Grid Middleware Tutorial / Grid Technologies IntroSlide 1 /14 Grid Technologies Intro Ivan Degtyarenko ivan.degtyarenko dog csc dot fi CSC – The Finnish.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,
BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Networking: Applications and Services Antonia Ghiselli, INFN Stu Loken, LBNL Chairs.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
U.S. ATLAS Computing Facilities Overview Bruce G. Gibbard Brookhaven National Laboratory U.S. LHC Software and Computing Review Brookhaven National Laboratory.
PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF
Scientific Computing in PPD and other odds and ends Chris Brew.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
XXIII HTASC Meeting – CERN March 2003 LIP and the Traveling Physicist Jorge Gomes LIP - Computer Centre.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
CHEP 2010 – Taiwan Oct. 18, 2010 Tony Wong (Brookhaven National Lab)
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Grid Computing: Running your Jobs around the World
Patrick Dreher Research Scientist & Associate Director
Design Unit 26 Design a small or home office network
Introduce yourself Presented by
Nuclear Physics Data Management Needs Bruce G. Gibbard
Wide Area Workload Management Work Package DATAGRID project
The DZero/PPDG D0/PPDG mission is to enable fully distributed computing for the experiment, by enhancing SAM as the distributed data handling system of.
Presentation transcript:

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL

Background Brookhaven National Lab (BNL) is a multi- disciplinary research laboratory funded by US government. Brookhaven National Lab (BNL) is a multi- disciplinary research laboratory funded by US government. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

Background (cont.) BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, , web, off-site data transfer, Grid, etc). RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, , web, off-site data transfer, Grid, etc).

Background (cont.) The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF RCF/ACF is transforming itself from a local resource into a national and global resource RCF/ACF is transforming itself from a local resource into a national and global resource Growing design and operational complexity Growing design and operational complexity Increasing staffing levels to handle additional responsibilities Increasing staffing levels to handle additional responsibilities

RCF/ACF Structure

Staff Growth at the RCF/ACF

The Pre-Grid Era Rack-mounted commodity hardware Rack-mounted commodity hardware Self-contained, localized resources Self-contained, localized resources Resources available only to local users Resources available only to local users Little interaction with external resources at remote locations Little interaction with external resources at remote locations Considerable freedom to set own usage policies Considerable freedom to set own usage policies

The (Near-Term) Future Resources available globally Resources available globally Distributed computing architecture Distributed computing architecture Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth Constraints on freedom to set own policies Constraints on freedom to set own policies

How do we get there? Change in management philosophy Change in management philosophy Evolution in hardware requirements Evolution in hardware requirements Evolution in software packages Evolution in software packages Different security protocol(s) Different security protocol(s) Change in access policy Change in access policy

Change in Management Philosophy Automated monitoring & management of servers in large clusters a must Automated monitoring & management of servers in large clusters a must Remote power management, predictive hardware failure analysis and preventive maintenance are important Remote power management, predictive hardware failure analysis and preventive maintenance are important High-availability based on large number of identical servers, not on 24-hour support High-availability based on large number of identical servers, not on 24-hour support Increasingly larger clusters only manageable if servers are identical  avoid specialized servers Increasingly larger clusters only manageable if servers are identical  avoid specialized servers

Evolution in Hardware Requirements Early acquisitions emphasized CPU power over local storage capacity Early acquisitions emphasized CPU power over local storage capacity Increasing affordability of local disk storage has changed this philosophy Increasing affordability of local disk storage has changed this philosophy Hardware chosen by optimal combination of CPU power, storage capacity, server density and price Hardware chosen by optimal combination of CPU power, storage capacity, server density and price Buy from high-quality vendors to avoid labor- intensive maintenance issues Buy from high-quality vendors to avoid labor- intensive maintenance issues

The Growth of the Linux Farm

Drop in Server Price as a Function of Performance

Drop in Cost of Local Storage

Total Distributed Storage Capacity

Growth of Storage Capacity per Server

Server Reliability

The Factors Enforcing Evolution in Software Packages Cost Cost Farm size / scalability Farm size / scalability Security Security External influences / wide acceptance External influences / wide acceptance

Cost Red Hat Linux → Scientific Linux Red Hat Linux → Scientific Linux LSF → Condor LSF → Condor

Farm Size / Scalability Home built batch system for data reconstruction → Condor based batch system Home built batch system for data reconstruction → Condor based batch system Home built monitoring system → Ganglia Home built monitoring system → Ganglia

Security Started with NIS/telnet in the 90’s Started with NIS/telnet in the 90’s Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. Testing GSI  limited support for GSI Testing GSI  limited support for GSI

Security Changes (cont.) Authorization & authentication controlled by local site (NIS and Kerberos) Authorization & authentication controlled by local site (NIS and Kerberos) Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access Accept certificates from multiple CA’s? Accept certificates from multiple CA’s? Difficult transition from complete to partial control over security issues Difficult transition from complete to partial control over security issues

External Influences / Wide Acceptance Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. HRM / dCACHE – used by other labs HRM / dCACHE – used by other labs Condor – widely used by Atlas community Condor – widely used by Atlas community

Software Evolution - summary PackageOldNewDate OS RedHat Linux Scientific Linux 2004 Batch Home- Built/LSF Condor/LSF2004/2000 MonitoringHome-BuiltGanglia2003 SecurityNISK5/GSI2003/2004 Distributed Storage HRM/dCache2004/?

Ganglia at the RCF/ACF

Condor at the RCF/ACF

Summary RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes Linux Farm built with commodity hardware is increasingly affordable and reliable Linux Farm built with commodity hardware is increasingly affordable and reliable Distributed storage is also increasingly affordable  management software issues. Distributed storage is also increasingly affordable  management software issues.

Summary (cont.) Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices Transition with security and access issues Transition with security and access issues Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy