Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.

Slides:



Advertisements
Similar presentations
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Advertisements

Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
The Mass Storage System at JLAB - Today and Tomorrow Andy Kowalski.
27/04/05Sabah Salih Particle Physics Group The School of Physics and Astronomy The University of Manchester
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Catania, April 2001.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Design & Management of the JLAB Farms Ian Bird, Jefferson Lab May 24, 2001 FNAL LCCWS.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.
An Overview of PHENIX Computing Ju Hwan Kang (Yonsei Univ.) and Jysoo Lee (KISTI) International HEP DataGrid Workshop November 8 ~ 9, 2002 Kyungpook National.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
Shigeki Misawa RHIC Computing Facility Brookhaven National Laboratory Facility Evolution.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
U.S. ATLAS Tier 1 Planning Rich Baker Brookhaven National Laboratory US ATLAS Computing Advisory Panel Meeting Argonne National Laboratory October 30-31,
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility (formerly CEBAF - The Continuous Electron Beam Accelerator Facility)
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
Terascala – Lustre for the Rest of Us  Delivering high performance, Lustre-based parallel storage appliances  Simplifies deployment, management and tuning.
RAL Site report John Gordon ITD October 1999
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Tier 1 at Brookhaven (US / ATLAS) Bruce G. Gibbard LCG Workshop CERN March 2004.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Batch Software at JLAB Ian Bird Jefferson Lab CHEP February, 2000.
SRM at Brookhaven Ofer Rind BNL RCF/ACF Z. Liu, S. O’Hare, R. Popescu CHEP04, Interlaken 27 September 2004.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Compute and Storage For the Farm at Jlab
Blueprint of Persistent Infrastructure as a Service
Presentation transcript:

Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory

Ofer Rind - RHIC Computing Facility Site Report Brookhaven National Lab is a multi-disciplinary DOE research laboratory RCF formed in the mid-90’s to provide computing infrastructure for the RHIC experiments. Named US Atlas Tier 1 computing center in late 90’s Currently supports both HENP and HEP scientific computing efforts as well as various general services (backup, , web hosting, off-site data transfer…) 25 FTE’s (expanding soon) RHIC Run-3 completed in Spring. Run-4 slated to begin in Dec/Jan RCF - Overview

Ofer Rind - RHIC Computing Facility Site Report RCF Structure

Ofer Rind - RHIC Computing Facility Site Report Mass Storage 4 StorageTek tape silos managed by HPSS (v4.5) Upgraded to B drives (200GB/cartridge) prior to Run-3 (~2 mos. to migrate data) Total data store of 836TB (~4500TB capacity) Aggregate bandwidth up to 700MB/s – expect 300MB/s in next run 9 data movers with 9TB of disk (Future: array to be fully replaced after next run with faster disk) Access via pftp and HSI, both integrated with K5 authentication (Future: authentication through Globus certificates)

Ofer Rind - RHIC Computing Facility Site Report Mass Storage

Ofer Rind - RHIC Computing Facility Site Report Centralized Disk Storage Large SAN served via NFS o Processed data store + user home directories and scratch o 16 Brocade switches and 150TB of Fibre Channel Raid5 managed by Veritas (MTI & Zzyzx peripherals) o 25 Sun Servers (E450 & V480) running Solaris 8 (load issues with nfsd and mountd precluded update to Solaris 9) o Can deliver data to farm at up to 55MB/sec/server RHIC and USAtlas AFS cells o Software repository + user home directories o Total of 11 AIX servers, 1.2TB (RHIC) & 0.5TB (Atlas) o Transarc on server side, OpenAFS on client side o RHIC cell recently renamed (standardized)

Ofer Rind - RHIC Computing Facility Site Report Centralized Disk Storage ZzyzxMTI E450’s

Ofer Rind - RHIC Computing Facility Site Report The Linux Farm 1097 dual Intel CPU VA and IBM rackmounted servers – total of 918 kSpecInt2000 Nodes allocated by expt and further divided for reconstruction & analysis 1GB memory typically + 1.5GB swap Combination of local SCSI & IDE disk with aggregate storage of >120TB available to users o Experiments starting to make significant use of local disk through custom job schedulers, data repository managers and rootd

Ofer Rind - RHIC Computing Facility Site Report The Linux Farm

Ofer Rind - RHIC Computing Facility Site Report The Linux Farm Most RHIC nodes recently upgraded to latest RH8 rev. (Atlas still at RH7.3) Installation of customized image via Kickstart server Support for networked file systems (NFS, AFS) as well as distributed local data storage Support for open source and commercial compilers (gcc, PGI, Intel) and debuggers (gdb, totalview, Intel)

Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Reconstruction Farm o Up to now, data reconstruction was managed by a locally produced Perl-based batch system o Over the past year, this has been completely rewritten as a Python-based custom frontend to Condor  Leverages DAGman functionality to manage job dependencies  User defines task using JDL identical to former system, then Python DAG-builder creates job and submits to Condor pool  Tk GUI provided to users to manage their own jobs  Job progress and file transfer status monitored via Python interface to a MySQL backend

Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Reconstruction Farm (cont.) oNew system solves scalability problems of former system oCurrently deployed for one expt. with others expected to follow prior to Run-4

Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Analysis Farm o LSF 5.1 licensed on virtually all nodes, allowing use of CRS nodes in between data reconstruction runs  One master for all RHIC queues, one for Atlas  Allows efficient use of limited hardware, including moderation of NFS server loads through (voluntary) shared resources  Peak dispatch rates of up to 350K jobs/week and 6K+ jobs/hour o Condor is being deployed and tested as a possible complement or replacement – still nascent, awaiting some features expected in upcoming release o Both accepting jobs through Globus gatekeepers

Ofer Rind - RHIC Computing Facility Site Report Security & Authentication Two layers of firewall with limited network services and limited interactive access exclusively through secured gateways Conversion to Kerberos5-based single sign-on paradigm o Simplify life by consolidating password databases (NIS/Unix, SMB, , AFS, Web). SSH gateway authentication  password-less access inside facility with automatic AFS token acquisition o RCF Status: AFS/K5 fully integrated, Dual K5/NIS authentication with NIS to be eliminated soon o USAtlas Status: “K4”/K5 parallel authentication paths for AFS with full K5 integration on Nov. 1, NIS passwords already gone o Ongoing work to integrate K5/AFS with LSF, solve credential forwarding issues with multihomed hosts, and implement a Kerberos certificate authority

Ofer Rind - RHIC Computing Facility Site Report US Atlas Grid Testbed Internet HPSS LSF (Condor) pool Gatekeeper Job manager Disks Grid Job Requests Globus -client 17TB 70MB/S atlas02 aafs amds Mover aftpexp00 GridFtp giis01 Information Server AFS server Globus RLS Server Local Grid development currently focused on monitoring and user management

Ofer Rind - RHIC Computing Facility Site Report Monitoring & Control Facility monitored by a cornucopia of vendor-provided, open-source and home-grown software...recently, o Ganglia was deployed on the entire farm, as well as the disk servers o Python-based “Farm Alert” scripts were changed from SSH push (slow), to multi- threaded SSH pull (still too slow), to TCP/IP push, which finally solved the scalability issues Cluster management software is a requirement for linux farm purchases (VACM, xCAT) o Console access, power up/down…really came in useful this summer!

Ofer Rind - RHIC Computing Facility Site Report The Great Blackout of ‘03

Ofer Rind - RHIC Computing Facility Site Report Future Plans & Initiatives Linux farm expansion this winter: addition of >100 2U servers packed with local disk Plans to move beyond NFS-served SAN with more scalable solutions: o Panasas - file system striping at block level over distributed clients o dCache - potential for managing distributed disk repository Continuing development of grid services with increasing implementation by the two large RHIC experiments Very successful RHIC run with a large high-quality dataset!