20-21 October 2015UK-T0 Workshop1 Experience of Data Transfer to the Tier-1 from a DIRAC Perspective Lydia Heck Institute for Computational Cosmology Manager of the DiRAC-2 Data Centric Facility COSMA
20-21 October 2015UK-T0 Workshop2 Talk layout ● Introduction to DiRAC ? ● The DiRAC computing systems ● What is DiRAC ● What type of science is done on the DiRAC facility ? ● Why do we need to copy data to RAL? ● Copying data to RAL – network requirements ● Collaboration between DiRAC and RAL to produce the archive ● Setting up the archiving tools ● Archiving ● Open issues ● Conclusions
20-21 October 2015UK-T0 Workshop3 Introduction to DiRAC ● DIRAC -- Distributed Research utilising Advanced Computing established in 2009 with DiRAC-1 ● Support of research in theoretical astronomy, particle physics and nuclear physics ● Funded by STFC with infrastructure money allocated from the Department for Business, Innovation and Skills (BIS) ● The running costs, such as staff costs and electricity are funded by STFC
Introduction to DiRAC, cont’d ● 2009 – DiRAC-1 – 8 installations across the UK of which COSMA-4 at the ICC in Durham is one. Still a loose federation. ● 2011/2012 – DiRAC-2 – major funding of £15M for e-Infrastructure – in bidding to host – 5 installations identified – judged by peers – for successful bidders scrutiny and interview by representatives for BIS to see if we could deliver by a tight deadline October 2015UK-T0 Workshop4
Introduction to DiRAC, cont’d ● DiRAC has full management structure. ● Computing time on the DiRAC facility is allocated through a peer-reviewed procedure. ● Current director: Dr Jeremy Yates, UCL ● Current technical director:Prof Peter Boyle, Edinburgh October 2015UK-T0 Workshop5
The DiRAC computing systems October 20156UK-T0 Workshop Blue Gene Edinburgh Cosmos Cambridge Complexity Leicester Data Centric Durham Data Analytic Cambridge
The DiRAC ● Edinburgh – IBM Blue Gene – cores – 1 Pbyte of GPFS storage – designed around (Lattice)QCD applications October 20157UK-T0 Workshop
DiRAC (Data Centric) ● Durham – Data Centric system –IBM IDataplex – 6720 Intel Sandy Bridge cores – 53.8 TB of RAM – FDR10 infiniband 2:1 blocking – 2.5 Pbyte of GPFS storage (2.2 Pbyte used!) October 20158UK-T0 Workshop
DiRAC Leicester Complexity – HP system 4352 Intel Sandy Bridge cores 30 Tbyte of RAM FDR 1:1 non-blocking 0.8 Pbyte of Panasas storage October 20159UK-T0 Workshop
DiRAC (SMP) ● Cambridge COSMOS ● SGI shared memory system – 1856 Intel Sandy Bridge cores – 31 Intel Xeon Phi co- processors – 14.8 Tbyte of RAM – 146 Tbyte of storage October UK-T0 Workshop
DiRAC (Data Analytic) Cambridge Data Analytic – Dell 4800 Intel Sandy Bridge cores 19.2 TByte of RAM FDR Infiniband 1:1 non- blocking 0.75 PB of Lustre storage October UK-T0 Workshop
What is DiRAC ● A national service run/managed/allocated by the scientists who do the science funded by BIS and STFC ● The systems are built around and for the applications with which the science is done. ● We do not rival a facility like ARCHER, as we do not aspire to run a general national service. ● DiRAC is classed as a major research facility by STFC on a par with the big telescopes October UK-T0 Workshop
What is DiRAC, cont’d ● Long projects with significant amount of CPU hours allocated for 3 years typically on a specific system – for 2012 – 2015 with examples: – Cosmos - dp002 : ~20M cpu hours on Cambridge Cosmos – Virgo-dp004 : 63M cpu hours on Durham DC – UK-MHD-dp010 : 40.5M cpu hours on Durham DC – UK-QCD-dp008 : ~700M cpu hours on Edinburgh BG – Exeter – dp005: ~15M cpu hours on Leicester Complexity – HPQCD – dp019 : ~20M cpu hours on Cambridge Data Analytic October 2015UK-T0 Workshop13
What type of Science is done on DiRAC ? ● For the highlights of science carried out on the DiRAC facility please see: ● Specific example: Large scale structure calculations with the Eagle run – 4096 cores – ~8 GB RAM/core – 47 days = 4,620,288 cpu hours – 200 TB of data October UK-T0 Workshop
Why do we need to copy data (to RAL) ? ● Original plan - each research project should make provisions for storing the research data – requires additional storage resource at researchers’ home institutions – Not enough provision – will require additional funds. – data creation considerably above expectation ? – if disaster struck many cpu hours of calculations would be lost October UK-T0 Workshop
Why do we need to copy data (to RAL) ? ● Research data must now be shared with/available to interested parties ● Install DiRAC’s own archive – requires funds and currently there is no budget. ● we needed to get started: – Jeremy Yates negotiated access to the RAL archive system ● Acquire expertise ● Identify bottlenecks and technical challenges – submitted 2,000,000 files and created an issue at the file servers ● How can we collaborate and make use of previous experience. ● AND: copy data! October UK-T0 Workshop
Copying data to RAL – network requirements ● network bandwidth – situation for Durham – now: ● currently possible Mbytes/sec ● required investment and collaboration from DU CIS ● upgrade to 6GBit/sec to JANET - Sep 2014 ● will be 10 Gbit/sec by end of 2015 – infra structure already installed – past: identified Durham related bottlenecks - FIREWALL October UK-T0 Workshop
Copying data to RAL – network requirements ● network bandwidth – situation for Durham investment to by-pass of external campus firewall: two new routers (~£80k) – configured for throughput with minimal ACL enough to safeguard site. deploying internal firewalls – part of new security infrastructure, essential for such a venture Security now relies on front-end system of Durham DiRAC and Durham GridPP October UK-T0 Workshop
Copying data to RAL – network requirements Result for COSMA and GridPP in Durham guaranteed 2-3 Gbit/sec with bursts of up to 3-4Gbit/sec (3 Gbit/sec outside of term time) pushed the network performance for Durham GridPP from bottom 3 in the country to top 5 of the UK GridPP sites achieves up to 300 – 400 Mbyte/sec throughput to RAL on archiving depending on file sizes October UK-T0 Workshop
Collaboration between DiRAC and GridPP/RAL ● Durham Institute for Computational Cosmology (ICC) volunteered to be the prototype installation ● Huge thanks to Jens Jensen and Brian Davies - there were many s exchanged, many questions asked and many answers given. ● Resulting document “Setting up a system for data archiving using FTS3” by Lydia Heck, Jens Jensen and Brian Davies October UK-T0 Workshop
Setting up the archiving tools ● Identify appropriate hardware – could mean extra expense: need freedom to modify and experiment with - cannot have HPC users logged in and working! free to do very latest security updates requires optimal connection to storage - infiniband card October UK-T0 Workshop
Setting up the archiving tools ● Create an interface to access the file/archving service at RAL using the GridPP tools – gridftp – Globus Toolkit – also provides Globus Connect – Trust anchors (egi-trustanchors) – voms tools (emi3-xxx) – fts3 (cern) October 2015UK-T0 Workshop22
Archiving? ● long-lived voms proxy? – myproxy-init; myproxy-logon; voms-proxy-init; fts-transfer- delegation ● How to create a proxy and delegation that lasts weeks even months? – still an issue ● grid-proxy-init; fts-transfer-delegation – grid-proxy-init –valid HH:MM – fts-transfer-delegation –e time-in-seconds – creates proxy that lasts up to certificate life time October 2015UK-T0 Workshop23
Archiving ● Large files – optimal throughput limited by network bandwidth ● Many small files – limited by latency; using ‘-r’ flag to fts-transfer- submit to re-use connection ● Transferred: – ~40 Tbytes since 20 August – ~2M files – challenge to FTS service at RAL ● User education on creating lots of small files October 2015UK-T0 Workshop24
Open issues ● ownership and permissions are not preserved ● depends on single admin to carry out. ● what happens when content in directories change? – complete new archive sessions? ● tries to archive all the files again but then ‘fails’ as file already exists – should be more like rsync October 2015UK-T0 Workshop25
Conclusions ● With the right network speed we can archive the DiRAC data to RAL. ● The documentation has to be completed and shared with the system managers on the other DiRAC sites ● Each DiRAC site will have their own dirac0X account ● Start with and keep on archiving ● Collaboration between DiRAC and GridPP/RAL DOES work! ● Can we aspire to more? October 2015UK-T0 Workshop26