Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.

Similar presentations


Presentation on theme: "Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing."— Presentation transcript:

1 Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory

2 Ofer Rind - RHIC Computing Facility Site Report Brookhaven National Lab is a multi-disciplinary DOE research laboratory RCF formed in the mid-90’s to provide computing infrastructure for the RHIC experiments. Named US Atlas Tier 1 computing center in late 90’s Currently supports both HENP and HEP scientific computing efforts as well as various general services (backup, email, web hosting, off-site data transfer…) 25 FTE’s (expanding soon) RHIC Run-3 completed in Spring. Run-4 slated to begin in Dec/Jan RCF - Overview

3 Ofer Rind - RHIC Computing Facility Site Report RCF Structure

4 Ofer Rind - RHIC Computing Facility Site Report Mass Storage 4 StorageTek tape silos managed by HPSS (v4.5) Upgraded to 37 9940B drives (200GB/cartridge) prior to Run-3 (~2 mos. to migrate data) Total data store of 836TB (~4500TB capacity) Aggregate bandwidth up to 700MB/s – expect 300MB/s in next run 9 data movers with 9TB of disk (Future: array to be fully replaced after next run with faster disk) Access via pftp and HSI, both integrated with K5 authentication (Future: authentication through Globus certificates)

5 Ofer Rind - RHIC Computing Facility Site Report Mass Storage

6 Ofer Rind - RHIC Computing Facility Site Report Centralized Disk Storage Large SAN served via NFS o Processed data store + user home directories and scratch o 16 Brocade switches and 150TB of Fibre Channel Raid5 managed by Veritas (MTI & Zzyzx peripherals) o 25 Sun Servers (E450 & V480) running Solaris 8 (load issues with nfsd and mountd precluded update to Solaris 9) o Can deliver data to farm at up to 55MB/sec/server RHIC and USAtlas AFS cells o Software repository + user home directories o Total of 11 AIX servers, 1.2TB (RHIC) & 0.5TB (Atlas) o Transarc on server side, OpenAFS on client side o RHIC cell recently renamed (standardized)

7 Ofer Rind - RHIC Computing Facility Site Report Centralized Disk Storage ZzyzxMTI E450’s

8 Ofer Rind - RHIC Computing Facility Site Report The Linux Farm 1097 dual Intel CPU VA and IBM rackmounted servers – total of 918 kSpecInt2000 Nodes allocated by expt and further divided for reconstruction & analysis 1GB memory typically + 1.5GB swap Combination of local SCSI & IDE disk with aggregate storage of >120TB available to users o Experiments starting to make significant use of local disk through custom job schedulers, data repository managers and rootd

9 Ofer Rind - RHIC Computing Facility Site Report The Linux Farm

10 Ofer Rind - RHIC Computing Facility Site Report The Linux Farm Most RHIC nodes recently upgraded to latest RH8 rev. (Atlas still at RH7.3) Installation of customized image via Kickstart server Support for networked file systems (NFS, AFS) as well as distributed local data storage Support for open source and commercial compilers (gcc, PGI, Intel) and debuggers (gdb, totalview, Intel)

11 Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Reconstruction Farm o Up to now, data reconstruction was managed by a locally produced Perl-based batch system o Over the past year, this has been completely rewritten as a Python-based custom frontend to Condor  Leverages DAGman functionality to manage job dependencies  User defines task using JDL identical to former system, then Python DAG-builder creates job and submits to Condor pool  Tk GUI provided to users to manage their own jobs  Job progress and file transfer status monitored via Python interface to a MySQL backend

12 Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Reconstruction Farm (cont.) oNew system solves scalability problems of former system oCurrently deployed for one expt. with others expected to follow prior to Run-4

13 Ofer Rind - RHIC Computing Facility Site Report Linux Farm - Batch Management Central Analysis Farm o LSF 5.1 licensed on virtually all nodes, allowing use of CRS nodes in between data reconstruction runs  One master for all RHIC queues, one for Atlas  Allows efficient use of limited hardware, including moderation of NFS server loads through (voluntary) shared resources  Peak dispatch rates of up to 350K jobs/week and 6K+ jobs/hour o Condor is being deployed and tested as a possible complement or replacement – still nascent, awaiting some features expected in upcoming release o Both accepting jobs through Globus gatekeepers

14 Ofer Rind - RHIC Computing Facility Site Report Security & Authentication Two layers of firewall with limited network services and limited interactive access exclusively through secured gateways Conversion to Kerberos5-based single sign-on paradigm o Simplify life by consolidating password databases (NIS/Unix, SMB, email, AFS, Web). SSH gateway authentication  password-less access inside facility with automatic AFS token acquisition o RCF Status: AFS/K5 fully integrated, Dual K5/NIS authentication with NIS to be eliminated soon o USAtlas Status: “K4”/K5 parallel authentication paths for AFS with full K5 integration on Nov. 1, NIS passwords already gone o Ongoing work to integrate K5/AFS with LSF, solve credential forwarding issues with multihomed hosts, and implement a Kerberos certificate authority

15 Ofer Rind - RHIC Computing Facility Site Report US Atlas Grid Testbed Internet HPSS LSF (Condor) pool Gatekeeper Job manager Disks Grid Job Requests Globus -client 17TB 70MB/S atlas02 aafs amds Mover aftpexp00 GridFtp giis01 Information Server AFS server Globus RLS Server Local Grid development currently focused on monitoring and user management

16 Ofer Rind - RHIC Computing Facility Site Report Monitoring & Control Facility monitored by a cornucopia of vendor-provided, open-source and home-grown software...recently, o Ganglia was deployed on the entire farm, as well as the disk servers o Python-based “Farm Alert” scripts were changed from SSH push (slow), to multi- threaded SSH pull (still too slow), to TCP/IP push, which finally solved the scalability issues Cluster management software is a requirement for linux farm purchases (VACM, xCAT) o Console access, power up/down…really came in useful this summer!

17 Ofer Rind - RHIC Computing Facility Site Report The Great Blackout of ‘03

18 Ofer Rind - RHIC Computing Facility Site Report Future Plans & Initiatives Linux farm expansion this winter: addition of >100 2U servers packed with local disk Plans to move beyond NFS-served SAN with more scalable solutions: o Panasas - file system striping at block level over distributed clients o dCache - potential for managing distributed disk repository Continuing development of grid services with increasing implementation by the two large RHIC experiments Very successful RHIC run with a large high-quality dataset!


Download ppt "Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing."

Similar presentations


Ads by Google