Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.

Slides:



Advertisements
Similar presentations
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Advertisements

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
GSIAF "CAF" experience at GSI Kilian Schwarz. GSIAF Present status Present status installation and configuration installation and configuration usage.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
Ceph Storage in OpenStack Part 2 openstack-ch,
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
Site Report: Tokyo Tomoaki Nakamura ICEPP, The University of Tokyo 2013/12/13Tomoaki Nakamura ICEPP, UTokyo1.
OSG Storage Architectures Tuesday Afternoon Brian Bockelman, OSG Staff University of Nebraska-Lincoln.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
Storage Federations and FAX (the ATLAS Federation) Wahid Bhimji University of Edinburgh.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
GridKa December 2004 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann dCache Implementation at FZK Forschungszentrum Karlsruhe.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.
EGEE Data Management Services
Australia Site Report Lucien Boland Goncalo Borges Sean Crosby
a brief summary for users
Jean-Philippe Baud, IT-GD, CERN November 2007
CernVM-FS vs Dataset Sharing
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
DPM at ATLAS sites and testbeds in Italy
Status: ATLAS Grid Computing
Scalable sync-and-share service with dCache
Dynamic Storage Federation based on open protocols
The ATLAS “DQ2 Accounting and Storage Usage Service”
Xiaomei Zhang CMS IHEP Group Meeting December
Doug Benjamin Duke University On Behalf of the Atlas Collaboration
Status of the SRM 2.2 MoU extension
Belle II Physics Analysis Center at TIFR
BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes
Future of WAN Access in ATLAS
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Taming the protocol zoo
Model (CMS) T2 setup for end users
Christof Hanke, HEPIX Spring Meeting 2008, CERN
NET2.
Brookhaven National Laboratory Storage service Group Hironori Ito
DPM releases and platforms status
WLCG Demonstrator R.Seuster (UVic) 09 November, 2016
Small site approaches - Sussex
Ákos Frohner EGEE'08 September 2008
Large Scale Test of a storage solution based on an Industry Standard
Grid Canada Testbed using HEP applications
INFNGRID Workshop – Bari, Italy, October 2004
Presentation transcript:

Australia Site Report Sean Crosby DPM Workshop – 13 December 2013

Supports ATLAS VO since inception Tier 1 site is TRIUMF (Canada Cloud) Australia-ATLAS Supports ATLAS VO since inception Tier 1 site is TRIUMF (Canada Cloud) 10Gb WAN link through Seattle Aim to, and delivering, 2% of ATLAS compute 2 (soon to be 3) fulltime sysadmins 1000 cores 900TB storage Support 100 researchers (Academics, Postdocs, PhD and Masters) Page volume sub information to go here

2x10GbE to switch (most compute nodes 1GbE) Australia-ATLAS DPM 1.8.7/SL6 11 disk servers 2x10GbE to switch (most compute nodes 1GbE) Page volume sub information to go here

SRMv2.2, xrootd, rfio, gsiftp, webdav, dpm, dpns Australia-ATLAS Standalone head node 1GbE SRMv2.2, xrootd, rfio, gsiftp, webdav, dpm, dpns Provides storage for physical cores (on our network), and Cloud cores (see later) Database moved to dedicated DB server – serves MySQL (MariaDB) and Postgres for all our other services Result – load on head node never exceeds 0.2! Page volume sub information to go here

Moved head node to SL6/puppet control, and all storage nodes to SL6 Progress this year Moved head node to SL6/puppet control, and all storage nodes to SL6 Drained 150TB from old nodes and decomissioned HammerCloud stress tested set up – before addition of new storage, on SL5, and on SL6 Changed Cloud queues to use xrootd for stagein (was rfio) Removed firewall from TRIUMF link Page volume sub information to go here

Got great efficiency benefits, but also stage-in time HammerCloud tests Got great efficiency benefits, but also stage-in time Page volume sub information to go here SL5 compute and old storage (some with 1GbE connections) SL5 compute and new storage (all 10GbE) SL6 compute and new storage (all 10GbE)

Tests done using rfio for stagein HammerCloud tests Tests done using rfio for stagein Want to repeat the tests with SL6/xrootd stagein/xrootd direct-io Have to wait until next storage purchase, as HC tests use approx 15TB of datasets as input, and must be stored on LOCALGROUPDISK. No room! Once results are known, will switch physical cores to xrootd Page volume sub information to go here

ATLAS and Webdav ATLAS requires sites to activate Webdav endpoint for Rucio namespace migration Once upgraded to 1.8.7 (and related lcgdm-dav), rename worked fine (see below for problems) Started on 6 November Completed 10 December Page volume sub information to go here

When rename started, head node was swamped ATLAS and Webdav problems When rename started, head node was swamped dpm daemon log showed all new requests in DPM_QUEUED state never processed DB connections reached approx 600 netstat connections reached 30000 (didn’t check number when dpm daemon wasn’t processing requests though) 98% of connections was in TIME_WAIT state to dpnsdaemon Page volume sub information to go here

Result – no more DPM “hangs” ATLAS and Webdav problems Changes made NsPoolSize reduced to 12 net.ipv4.tcp_tw_reuse = 1 net.netfilter.nf_conntrack_tcp_timeout_established = 3600 (was 432000) net.netfilter.nf_conntrack_tcp_timeout_time_wait = 15 (was 120) net.ipv4.ip_local_port_range = 15001 61000 ATLAS reduced concurrent Webdav connections Result – no more DPM “hangs” Page volume sub information to go here

After rename, all data in old, non-Rucio locations, is obviously dark Current issues After rename, all data in old, non-Rucio locations, is obviously dark How to find number of files and sizes? Tried gfal2 mount of SRM head node Directories show up as 0 bytes in size Will try gfal2 mount of xrootd/DAV head node next Page volume sub information to go here

Current set up is not balanced Current issues Current set up is not balanced drain of 8 old disk servers 90% of data (140TB) went to the 3 new disk servers Need dpm-addreplica API to support specifying specific disk node/FS Page volume sub information to go here

Starting to support Belle2 VO Developments Starting to support Belle2 VO Started new GOC site Australia-T2 Will use storage provided for us in different cities in Aus Page volume sub information to go here

Storage nodes near compute as well (also Cloud provided) Developments Storage will be attached to VMs (also provided to us) mostly via iSCSI on hypervisor (Openstack Cinder) We should get 1PB Storage nodes near compute as well (also Cloud provided) Will also support ATLAS Page volume sub information to go here

Developments Current idea Single SE BELLEDATADISK and ATLASDATADISK token, distributed across all storage locations MELBELLESCRATCH, ADLBELLESCRATCH, SYDBELLESCRATCH....., to ensure output is written local to compute node Can DPM support this? dCache Page volume sub information to go here

Our Tier 3/batch cluster needs /home and /data storage Developments Our Tier 3/batch cluster needs /home and /data storage multiple locations 1000 cores (cloud provided) Current idea /home (40TB) provided by CEPH-FS out of Melbourne /data/mel, /data/adl, /data/syd writable using NFS on compute node, readable using local xrootd federation from any compute node in any location Local WAN caches to reduce latency Work in progress Page volume sub information to go here

Wahid for xrootd ATLAS configs Thanks Wahid for xrootd ATLAS configs Sam and Wahid for troubleshooting and all round help David, Oliver, Fabrizio, Alejandro, Adrien for great chats at CERN this year Everyone on DPM mailing list Page volume sub information to go here

scrosby@unimelb.edu.au