BNL RACF Site Report HEPIX SPRING 2015 – OXFORD, UK WILLIAM STRECKER-KELLOGG

Slides:

Advertisements

Similar presentations

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Advertisements

Chris Brew RAL PPD HEPiX Workshop Summery Brookhaven National Laboratory October Chris Brew CCLRC.

Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

U.S. Department of Energy Brookhaven Science Associates BNL’s Role in High Energy Physics Thomas B.W. Kirk Associate Director for High Energy and Nuclear.

Physics Department Meeting June 19, 2008  Safety items: Mike Zarcone  DOE Emergency Management Audit next week  Department news  Budget update  Looking.

Source – Tier1 Research Internet Infrastructure Market Overview Spring 2011.

HEPiX Autumn Meeting 2014 University of Nebraska, Lincoln 2 Arne Wiebalck Liviu Valsan Borja Aparicio Wiebalck, Valsan,

Operating Dedicated Data Centers – Is It Cost-Effective? CHEP Amsterdam Tony Wong - Brookhaven National Lab.

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Spring 2013 EPICS Collaboration Meeting. When ? Main EPICS meeting – Wednesday 1 st May to Friday 3 rd May 2013 Training and pre-meetings – Monday 29.

Autumn HEPiX 2004 Brookhaven National Lab Tom Throwe 18 October 2004.

Slide 1 Experiences with NMI R2 Grids Software at Michigan Shawn McKee April 8, 2003 Internet2 Spring Meeting.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

BROOKHAVEN NATIONAL LAB HEPIX FALL Request to host HEPIX meeting (October 12-16) approved last November Last HEPIX BNL in Fall 2004 Venue.

BNL Site Report Ofer Rind Brookhaven National Laboratory Spring HEPiX Meeting, CASPUR April 3, 2006.

New Data Center at BNL– Status Update HEPIX – CERN May 6, 2008 Tony Chan - BNL.

HEPiX, CASPUR, April 3-7, 2006 – Steve McDonald Steven McDonald TRIUMF Network & Computing Services Canada’s National Laboratory.

K. De UTA Grid Workshop April 2002 U.S. ATLAS Grid Testbed Workshop at UTA Introduction and Goals Kaushik De University of Texas at Arlington.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.

BNL Facility Status and Service Challenge 3 Zhenping Liu, Razvan Popescu, Xin Zhao and Dantong Yu USATLAS Computing Facility Brookhaven National Lab.

Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.

Location: BU Center for Computational Science facility, Physics Research Building, 3 Cummington Street.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

Grid User Management System Gabriele Carcassi HEPIX October 2004.

Brookhaven Science Associates U.S. Department of Energy Washington DC Visit by 3 UEC Chairs 3-Chairs visits to Washington DC: April 11, 2005: Gary, Vicki,

Brookhaven Analysis Facility Michael Ernst Brookhaven National Laboratory U.S. ATLAS Facility Meeting University of Chicago, Chicago 19 – 20 August, 2009.

TRIUMF Tony Wong Oct. 25, General Facility News Power & Cooling Average ~810 KW usage as of September 2011 Usable capacity of ~1.8 MW of.

Proposal for HCP2009 for Brookhaven National Laboratory Sally Dawson Howard Gordon.

BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.

BNL Facility Status and Service Challenge 3 HEPiX Karlsruhe, Germany May 9~13, 2005 Zhenping Liu, Razvan Popescu, and Dantong Yu USATLAS/RHIC Computing.

HEPiX CASPUR Highlights and Comments. Computer Room Issues Computer room cooling and air conditioning systems: mentioned in a majority of site reports,

ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.

ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.

HEPiX Summary Fall 2014 – Lincoln, Nebraska Martin Bly.

Brookhaven Science Associates U.S. Department of Energy Brookhaven National Laboratory A brief overview P.D. Bond.

December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.

HEPiX 2 nd Nov 2000 Alan Silverman Proposal to form a Large Cluster SIG Alan Silverman 2 nd Nov 2000 HEPiX – Jefferson Lab.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Brookhaven Science Associates U.S. Department of Energy USATLAS Tier 1 & 2 Networking Meeting Scott Bradley Manager, Network Services 14 December 2005.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.

U.S. ATLAS Computing Facilities Overview Bruce G. Gibbard Brookhaven National Laboratory U.S. LHC Software and Computing Review Brookhaven National Laboratory.

How To Get Started in Analysis with Atlas software FAQ on How to Get account at BNL

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

U.S. ATLAS Tier 1 Networking Bruce G. Gibbard LCG T0/1 Network Meeting CERN 19 July 2005.

RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.

OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

LBNL/NERSC/PDSF Site Report for HEPiX Catania, Italy April 17, 2002 by Cary Whitney

HEPiX report Spring 2015 HEPiX meeting Oxford University, UK Helge Meinhard, CERN-IT Grid Deployment Board 10-Jun Helge Meinhard (at) CERN.ch.

Spring 2012 Comm Internship Assessment – Supervisors’ Evaluation Answers.

New Directions and BNL USATLAS TIER-3 & NEW COMPUTING DIRECTIVES William Strecker-Kellogg RHIC/ATLAS Computing Facility (RACF) Brookhaven National.

FroNTier at BNL Implementation and testing of FroNTier database caching and data distribution John DeStefano, Carlos Fernando Gamboa, Dantong Yu Grid Middleware.

Brookhaven Science Associates 2006 RHIC & AGS Annual Users’ Meeting June 5-9, 2006 Sam Aronson Interim Director.

Brookhaven Science Associates U.S. Department of Energy 1 n BNL –8 OSCARS provisioned circuits for ATLAS. Includes CERN primary and secondary to LHCNET,

Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.

Providing and Obtaining Information with OSG Tools Scott Teige Indiana University OSG GOC.

HEPiX Fall 2014 U Nebraska Lincoln, USA Workshop Wrap-Up Sandy Philpott, Helge Meinhard.

Expansion Plans for the Brookhaven Computer Center HEPIX – St. Louis November 7, 2007 Tony Chan - BNL.

BigPanDA Technical Interchange Meeting July 20, 2017 Hong Ma

BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes

Mattias Wadenstein Hepix 2011 Fall Meeting , Vancouver

LHC Data Analysis using a worldwide computing grid

Presentation transcript:

BNL RACF Site Report HEPIX SPRING 2015 – OXFORD, UK WILLIAM STRECKER-KELLOGG

RHIC/ATLAS Computing Facility  Brookhaven National Lab  About 70 miles from NYC  Hosts RHIC  Provides computing services for both RHIC experiments and serves as ATLAS Tier-1 for the US.  Around 46kCpus of computing  Supports smaller experiments—LSST Dayabay, LBNE, EIC, etc…  <1kCpu of computing BCF CDCE Sigma-7 2

Linux Farm Updates  RHIC purchased 160 Ivybridge servers from Dell  62 kilo-HEPSpec06  6.7PB Disk Storage  12x4 TB Drives in RAID5  USATLAS Bought 77 Ivybridge nodes  30 KHS06  Disk-light, used for scratch space only  Evaluating Haswell—See Tony Wong’s presentation on remote evaluation 3

Infrastructure Updates  New CRAC unit providing 30 tons of additional cooling to BCF  Future Plans  Replace aging PDU and upgrade APC batteries  Provide additional 200kW of UPS power to CDCE  Refurbish 40-year-old air handlers in basement of BCF  Formulating proposal for upgrading building 725 (old NSLS) into a datacenter  Little space left in our rooms, additional space needed to support NSLS-II and other research activities around the lab 4

Evaluating Software  Looking at Docker and batch-system support for containers  Useful to support non-native (linux-based) workloads on our dedicated systems  Evaluation of SL7  Support in our config-management system is ready  Already used for new infrastructure (DNS hosts / NTP servers, etc…)  No plans to upgrade Farm soon 5

HPSS Tape Storage  Over 60PB in 53K Tapes  Writing new RHIC data to LTO- 6 tapes, 16 drives  ATLAS moving to LTO-6 soon  Migration from lower-capacity cartridges ongoing (LTO-3  LTO-5)  Completed testing for Dual-copy with mixed T10KD and LTO-6 cartridges  Copying old tapes halted during RHIC run 6

HPSS Tape Storage cont…  Performance Measures  New LTO-6 drives gets 133MB/s writes  T10KD drives writes 244MB/s  New bug discovered while packing tapes  Intensive repacking leads to issues with monitoring that has disrupted the production system  Upgrade planned by end of 2015, for bugfixes and RHEL7 client support 7

Central Storage—NFS  2 Hitachi HNAS-4100 heads are in production  Have purchased an additional 2 HNAS-4100 heads  Phased out 4 Titan-3200, with an additional 2 Titans to remain in production through this year  Mercury cluster for ATLAS is being phased out  Goal of no dependency on local NFS storage for ATLAS jobs 8

Central Storage—GPFS  Fully puppet-deployable on the client  Still needs server configuration—could be solved with exported resources  Updated client and server to , no problems  Sustained spikes observed up to 24 GB/s without problems for the clients  Had to put Cgroup throttle on xrootd’s /dev/sda usage because GPFS commands had sufficient latency to drop the node from the cluster 9

Distributed Storage Updates  PHENIX dCache and STAR XrootD now over 7.5PB each  “Raw” data on PHENIX dCache expanded to 3PB  Used for tape-free “fast” production runs  A portion is used as a repository for special PHENIX jobs running on OSG resources  Scratch space for PHENIX production moved from dCache into GPFS  Performing very well 10

ATLAS Storage  dCache upgraded to version  ATLAS xrootd n2n plugin on xrootd door  Providing ATLAS N2N and redirection to remote redirector when a file is not found  Xrootd Versions  reverse proxy  4.x.y (latest git for bugfixes) forward proxy  FTS  ATLAS jobs  Moving from copy-to-scratch to direct-IO via xrootd door 11

CEPH Developments  Storage backend extended to 85 disk arrays  3.7k 1Tb HDDs, 3x replication for a usable capacity of 1Pb  Head node network can now handle 50GB/s  Running 2 separate clusters—main ATLAS prod (60%) and federated CEPH(40%)  New cluster components based on kernel and CEPH  Tests of CephFS and mounted RBD volumes were performed with 70x1GbE attached clients  Long-term CephFS stability tests with kernel and Ceph v show stability across a 6 month period.  Please refer to the dedicated talk by Ofer Rind 12

S3 CEPH Throughput Test Proxy was populated with 202 x 3.6GB files to run a custom job to access CEPH via python/Boto Test server specifications ►47GB memory total ►16 processors ►2x10Gb/s LACP Network connectivity ►Data stored in s DS3500 backend storage ►17 SAS disks /15kRPM configured in a ►RAID 0 layout Client Proxy Storage BNL CEPH S3 bucket BNL CEPH S3 bucket LAN ( /24) 10GB/s Destination CEPH BNL S3 bucket accessed through RADOS GWs(2)  Two copy jobs sequentially started 10 minutes apart  Each job copied 202 x 3.6GB files  128 transfers/job in parallel allowed  242MB/s maximum network throughput observed 13

Configuration Management  Puppet  Migrated away from passanger/apache to new puppetserver  Faster catalog compilation time, fewer errors seen  Work on Jenkins automated integration testing  See William Strecker-Kellogg’s talk on Puppet later in the week 14

OpenStack in Production  Icehouse cluster  47 hosts (16CPU, 32Gb RAM, 750Gb Disk)  Second equivalent test cluster slated for Juno  Used internally for ATLAS, 3 external tenants  ATLAS Tier-3  BNL Biology Group  BNL Computer Sciences Group 15

Amazon Pilot Project  Since Sept. 2014, major project to run ATLAS at full scale on EC2  Compute: 3 regions, ~5k node test in November  Storage: EC2-based SE (SRM + s3fs)  ATLAS application customizations  S3-enabled stage in/out  Event service (checkpointing) to leverage spot pricing  See talks later in the week by John Hover (BNL) and Dario Rivera (Amazon) 16

The End Questions? Comments? Thank You! 17