NET2.

Slides:



Advertisements
Similar presentations
National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.
Computer Networks All you need to know. What is a computer network? Two or more computers connected together so that they can communicate with each other.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
CCNA 2 Week 1 Routers and WANs. Copyright © 2005 University of Bolton Welcome Back! CCNA 2 deals with routed networks You will learn how to configure.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Factors affecting ANALY_MWT2 performance MWT2 team August 28, 2012.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
US ATLAS Western Tier 2 Status Report Wei Yang Nov. 30, 2007 US ATLAS Tier 2 and Tier 3 workshop at SLAC.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
Southwest Tier 2 (UTA). Current Inventory Dedidcated Resources  UTA_SWT2 320 cores - 2GB/core Xeon EM64T (3.2GHz) Several Headnodes 20TB/16TB in IBRIX/DDN.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Managing a growing campus pool Eric Sedore
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
Compute and Storage For the Farm at Jlab
a brief summary for users
Elastic Cyberinfrastructure for Research Computing
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
Bob Ball/University of Michigan
Ian Bird WLCG Workshop San Francisco, 8th October 2016
The Beijing Tier 2: status and plans
The New APEL Client Will Rogers, STFC.
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
+ John + PIs.
Update on SINET5 implementation for ICEPP (ATLAS) and KEK (Belle II)
BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes
Diskpool and cloud storage benchmarks used in IT-DSS
Future of WAN Access in ATLAS
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
AGLT2 Site Report Shawn McKee/University of Michigan
Vanderbilt Tier 2 Project
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Update on Plan for KISTI-GSDC
CMS transferts massif Artem Trunov.
Update from the HEPiX IPv6 WG
CC IN2P3 - T1 for CMS: CSA07: production and transfer
ATLAS Sites Jamboree, CERN January, 2017
AGLT2 Site Report Shawn McKee/University of Michigan
Brookhaven National Laboratory Storage service Group Hironori Ito
Hironori Ito Brookhaven National Laboratory
Manchester HEP group Network, Servers, Desktop, Laptops, and What Sabah Has Been Doing Sabah Salih.
ADC Requirements and Recommendations for Sites
The INFN Tier-1 Storage Implementation
Discussions on group meeting
HEPiX IPv6 Working Group F2F Meeting
If you Want pass Your Exam In first Attempt practice-questions.html.
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
Grid Canada Testbed using HEP applications
Romanian Sites Current Status
Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.
ETHZ, Zürich September 1st , 2016
Kanga Tim Adye Rutherford Appleton Laboratory Computing Plenary
IPv6 update Duncan Rand Imperial College London
RHUL Site Report Govind Songara, Antonio Perez,
The LHCb Computing Data Challenge DC06
Presentation transcript:

NET2

Daily Busiest Hour for the Past 15 Months Bridge+storage FY16+RBT FY15

We let OSG fill in the gaps during ATLAS production lulls NET2 running cores LIGO computing at one of the US ATLAS Tier2s peaks at about 700 cores on the day before the press conference LIGO press conference announcing 2d black hole merger

All FY16, RBT, Bridge and Storage funds spent, installed and on-line 9000 cores, 97k hs06, 5.5 PB

GPFS Main Storage Warranty 2015-08 3TB 2016-02 4TB 2018-01 6TB 2020-12 2021-08 290 Gb/s Total Bandwidth 6.0 PB total useable storage 0.5 PB about to retire, 1.0 PB additional in 0.5-1.0 years

LSM We have a custom python code written to Marco Mambelli’s original specifications (still in effect, we understand). It Lets us control cp, scp, direct, <other> for different nodes (BU, Harvard, MOC) Allows us to control where the LSM traffic goes, e.g. 20G dedicated link from BU to HU. Allows us to control burstiness of read/writes to GPFS, which can cause problems otherwise. Allows us to efficiently compute Adler checksums on GPFS servers, storing in extended attributes. Plan to add S3 and/or librados I/O to NESE.

SRM Migration Plan to migrate main storage to NESE Most likely use Object Store (prob. S3) FTS transfers either via Gridftp or FTS S3 endpoint directly Convert LSM to use S3 Xrootd TBD with Wei

H GPFS NET2 Current Networking Boston University Nexus 7000 32 x 10G Northern Crossroads ASR 4 x 10G 20G Cisco N5K S4810 48x10G + 4 x 40G uplink H 25 x 10G S4048 S4048 GPFS 36 worker nodes with 10G 36 worker nodes with 10G

Incoming Gridftp Transfers Resume at 20G following switch to new Cisco Equipment at NoX Incoming bytes per second

Joined LHCONE Haven’t seen >12 Gb/s since…needs testing No saturation problems so far

Networking Plan: Take maximum advantage of NESE project Network via NESE via Harvard 100Gb/s to NoX/internet2 and LHCONE Upgrade to multi x 40Gb/s WAN avoiding Cisco gear Initially need ~4 x 40G from NET2 to NESE fabric NESE networking being planned for IPv6. This is connected with re-thinking networking on the MGHPCC floor.

LOCALGROUPDISK It is likely our LOCALGROUPDISK is full of old data that nobody cares about anymore.

HTCONDOR-CE Status: Both BU and HU converted

Plan for SL7 Migration Wait for someone else to try first. Test that GPFS still works. Migrate as before coordinating with other T2s. We have done this before going from RH3->4->5->6. We don’t anticipate problems.

Current Worker Nodes 307 nodes 8076 cores 82 k hs06

Same for NET2/Harvard GRAM/LSF to HTCONDOR-CE/SLURM

One Entire Year of GGUS Tickets! Aborted attempt to move WAN to new NoX routers McGill networking problem in Canadian WAN. Request to install webdav. – not a site issue. McGill networking problem in Canadian WAN. McGill networking problem in Canadian WAN. Burst of SRM attempts to delete non-existing files caused Bestman to clog up. Similar GPFS migration problem. Fixed in 1 hour. GPFS migration slowdown and DDM errors. Fixed in 5 hours. SRM problem caused deletion errors. Fixed in 6 hours. Trivial problem with requested storage dumps fixed in 10 minutes. One task has low efficiency due to WAN saturation due to NET2 being used as a “nucleus” site. SRM problem. Fixed in 10 minutes. GPFS experiment during low PanDA production gone wrong. 7 hours to recover. Request for storage dumps – not a site problem. SRM deletion problem, fixed in ~1 hour