BNL Site Report Ofer Rind Brookhaven National Laboratory Spring HEPiX Meeting, CASPUR April 3, 2006.

Slides:

Advertisements

Similar presentations

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Advertisements

Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Catania, April 2001.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.

BNL Facility Status and Service Challenge 3 Zhenping Liu, Razvan Popescu, Xin Zhao and Dantong Yu USATLAS Computing Facility Brookhaven National Lab.

Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.

Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA

Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.

Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606

BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

IST Storage & Backup Group 2011 Jack Shnell Supervisor Joe Silva Senior Storage Administrator Dennis Leong.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April

JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.

Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606

BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.

US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.

BNL Facility Status and Service Challenge 3 HEPiX Karlsruhe, Germany May 9~13, 2005 Zhenping Liu, Razvan Popescu, and Dantong Yu USATLAS/RHIC Computing.

US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.

Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.

ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.

CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

US ATLAS Western Tier 2 Status Report Wei Yang Nov. 30, 2007 US ATLAS Tier 2 and Tier 3 workshop at SLAC.

Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.

Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.

Office of Science U.S. Department of Energy NERSC Site Report HEPiX October 20, 2003 TRIUMF.

Tier 1 at Brookhaven (US / ATLAS) Bruce G. Gibbard LCG Workshop CERN March 2004.

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.

Batch Software at JLAB Ian Bird Jefferson Lab CHEP February, 2000.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.

1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Luca dell’Agnello INFN-CNAF

ATLAS Sites Jamboree, CERN January, 2017

The INFN Tier-1 Storage Implementation

Presentation transcript:

BNL Site Report Ofer Rind Brookhaven National Laboratory Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report Ofer Rind Brookhaven National Laboratory Spring HEPiX Meeting, CASPUR April 4, 2006

BNL Site Report, Spring Hepix (Brief) Facility Overview RHIC/ATLAS Computing Facility is operated by BNL Physics Dept. to support the scientific computing needs of two large user communities –RCF is the “Tier-0” facility for the four RHIC expts. –ACF is the Tier-1 facility for ATLAS in the U.S. –Both are full-service facilities >2400 Users, 31 FTE RHIC Run6 (Polarized Protons) started March 5th

BNL Site Report, Spring Hepix Mass Storage Soon to be in full production.... –Two SL8500’s: 2 X 6.5K tape slots, ~5 PB capacity –LTO-3 drives: 30 X 80 MB/sec; 400 GB/tape (native) –All Linux movers: 30 RHEL4 machines, each with 7 Gbps ethernet connectivity and aggregate 4 Gbps direct attached connection to DataDirect S2A fibre channel disk This is in addition to the 4 STK Powderhorn silos already in service (~4 PB, 20K 9940B tapes) Transition to HPSS 5.1 is complete –“It’s different”...learning curve due to numerous changes –PFTP: client incompatibilities and cosmetic changes Improvements to Oak Ridge Batch System optimizer –Code fixed to remove long-time source of instability (no crashes since) –New features being designed to improve access control

BNL Site Report, Spring Hepix Centralized Storage NFS: Currently ~220 TB of FC SAN; 37 Solaris 9 servers –Over the next year, plan to retire ~100 TB of mostly NFS served storage (MTI, ZZYZX) AFS: RHIC and USATLAS cells –Looking at Infortrend disk (SATA + FC frontend + RAID6) for additional 4 TB (raw) per cell –Future: upgrade to OpenAFS 1.4 Panasas: 20 shelves, 100 TB, heavily used by RHIC

BNL Site Report, Spring Hepix Panasas Issues Panasas DirectFlow (version 2.3.2) –High performance and fairly stable, but...problematic from an administrative perspective: Occasional stuck client-side processes left in uninterruptible sleep DirectFlow module causes kernel panics from time to time –Can always panic a kernel with panfs mounted by running a Nessus scan on the host Changes in ActiveScale server configuration (e.g. changing the IP addresses of non-primary director blades), which the company claims are innocuous, can cause clients to hang. Server-side NFS limitations –NFS mounting was tried and found to be unfeasible as a fallback option with our configuration  heavy NFS traffic causes director blades crash; Panasas suggests limiting to <100 clients per director blade.

BNL Site Report, Spring Hepix Update on Security Nessus scanning program implemented as part of ongoing DOE C&A process –Constant low-level scanning –Quarterly scanning: more intensive  port exclusion scheme to protect sensitive processes Samhain –Filesystem integrity checker (akin to Tripwire) with central management of monitored systems –Currently deployed on all administrative systems

BNL Site Report, Spring Hepix Linux Farm Hardware >4000 processors, >3.5 MSI2K ~700 TB of local storage (SATA, SCSI, PATA) SL3.05(03) for RHIC (ATLAS) Evaluated dual-core Opteron & Xeon for upcoming purchase –Recently encountered problems with Bonnie++ I/O tests using RHEL4 64 bit w/software RAID+LVM on Opteron –Xeon (Paxville) gives poor SI/watt performance

BNL Site Report, Spring Hepix Power & Cooling Power & Cooling now significant factors in purchasing Added 240KW to facility for ‘06 upgrades –Long term: possible site expansion Liebert XDV Vertical Top Cooling Modules to be installed on new racks CPU and ambient temperature monitoring via dtgraph and custom python scripts

BNL Site Report, Spring Hepix Distributed Storage Two large dCache instances (v1.6.6) deployed in hybrid server/client model: –PHENIX: 25 TB disk, 128 servers, >240 TB data –ATLAS: 147 TB disk, 330 servers, >150 TB data –Two custom HPSS backend interfaces –Perf. tuning on ATLAS write pools –Peak transfer rates of >50 TB/day Other: large deployment of Xrootd (STAR), rootd, anatrain (PHENIX)

BNL Site Report, Spring Hepix Batch Computing All reconstruction and analysis batch systems have been migrated to Condor, except STAR analysis ---which still awaits features like global job-level resource reservation --- and some ATLAS distributed analysis (these use LSF 6.0) Configuration: –Five Condor (6.6.x) pools on two central managers –113 available submit nodes –One monitoring/Condorview server and one backup central manager Lots of performance tuning –Autoclustering of jobs for scheduling; timeouts; negotiation cycle; socket cache; collector query forking; etc....

BNL Site Report, Spring Hepix Condor Usage Use of heavily modified CondorView client to display historical usage.

BNL Site Report, Spring Hepix Condor Flocking Goal: Full utilization of computing resources on the farm Increasing use of a “general queue” which allows jobs to run on idle resources belonging to other experiments, provided that there are no local resources available to run the job Currently, such “opportunistic” jobs are immediately evicted if a local job places a claim on the resource >10K jobs completed so far

BNL Site Report, Spring Hepix Condor Monitoring Nagios and custom scripts provide live monitoring of critical daemons Place job history from ~100 submit nodes into central database –This model will be replaced by Quill. –Custom statistics extracted from database (i.e. general queue, throughput, etc.) Custom startd, schedd, and “startd cron” ClassAds allow for quick viewing of the state of the pool using Condor commands –Some information accessible via web interface Custom startd ClassAds allow for remote and peaceful turn off of any node –Not available in Condor –Note that the “condor_off -peaceful” command (v6.8) cannot be canceled, must wait until running jobs exit

BNL Site Report, Spring Hepix Nagios Monitoring services total and 1963 hosts: average of 7 services checked per host. Originally had one nagios server (dual 2.4 Gz).... –Tremendous latency: services reported down many minutes after the fact. –Web interface completely unusable (due to number of hosts and services)...all of this despite a lot of nagios and system tuning... –Nagios data written to ramdisk –Increased no. of file descriptors and no. of processes allowed –Monitoring data read from MySQL database on separate host –Web interface replaced with lightweight interface to the database server Solution: split services roughly in half between two nagios servers. –Latency is now very good –Events from both servers logged to one MySQL server –With two servers there is still room for many more hosts and a handful of service checks.

BNL Site Report, Spring Hepix Nagios Monitoring

BNL Site Report, Spring Hepix ATLAS Tier-1 Activities OSG 0.4, LCG 2.7 (this wk.) ATLAS Panda (Production And Distributed Analysis) used for production since Dec. ‘05 –Good performance in scaling tests, with low failure rate and manpower requirements Network Upgrade –2 X 10 Gig LAN and WAN Terapath QoS/MPLS (BNL, UM, FNAL, SLAC, ESNET) DOE supported project to introduce end-to- end QoS network into data mgmt. Ongoing intensive development w/ESNET SC 2005

BNL Site Report, Spring Hepix ATLAS Tier-1 Activities SC3 Service Phase (Oct-Dec 05) –Functionality validated for full production chain to Tier-1 –Exposed some interoperability problems between BNL dCache and FTS (fixed now) –Needed further improvement in operation, performance and monitoring. SC3 Rerun Phase (Jan-Feb 06) –Achieved performance (disk-disk, disk-tape) and operations benchmarks

BNL Site Report, Spring Hepix ATLAS Tier-1 Activities SC4 Plan: –Deployment of storage element, grid middleware, (LFC, LCG, FTS), and ATLAS VO box –April: Data throughput phase (disk-disk and disk- tape); goal is T0 to T1 operational stability –May: T1 to T1 data exercise. –June: ATLAS Data Distribution from T0 to T1 to select T2. –July-August: Limited distributed data processing, plus analysis –Remainder of ‘06: Increasing scale of data processing and analysis.

BNL Site Report, Spring Hepix Recent P-P Collision in STAR