Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Slides:

Advertisements

Similar presentations

Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…

Advertisements

Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

ATLAS computing in Geneva Szymon Gadomski, NDGF meeting, September 2009 S. Gadomski, ”ATLAS computing in Geneva", NDGF, Sept 091 the Geneva ATLAS Tier-3.

Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.

FZU participation in the Tier0 test CERN August 3, 2006.

Quarterly report SouthernTier-2 Quarter P.D. Gronbech.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

PPD Computing “Business Continuity” David Kelsey 3 May 2012.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.

Quarterly report ScotGrid Quarter Fraser Speirs.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,

Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.

What is expected from ALICE during CCRC’08 in February.

BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

CSCS Status Peter Kunszt Manager Swiss Grid Initiative CHIPP, 21 April, 2006.

Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.

Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.

Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.

V.Ilyin, V.Gavrilov, O.Kodolova, V.Korenkov, E.Tikhonenko Meeting of Russia-CERN JWG on LHC computing CERN, March 14, 2007 RDMS CMS Computing.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.

HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

WLCG IPv6 deployment strategy

LCG Service Challenge: Planning and Milestones

Cross-site problem resolution Focus on reliable file transfer service

James Casey, IT-GD, CERN CERN, 5th September 2005

BNL FTS services Hironori Ito.

Update on Plan for KISTI-GSDC

1 VO User Team Alarm Total ALICE ATLAS CMS

Status of MC production on the grid

The LHCb Computing Data Challenge DC06

Presentation transcript:

Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006

Outline What are we testing and why? What is the setup? Hardware and Software Infrastructure Test Procedures Lessons and Successes RAL Castor Conclusions and Future

What and Why What: –Set up systems and people to test the rates at which the UK Tier 2 sites can import and export data Why: –Once the LHC experiments are up and running Tier 2 sites will need to absorb data from and upload data to the Tier 1s at quite alarming rates: ~1Gb/s for a medium sized Tier 2 –UK has a number of “experts” in tuning DPM/dCache, this should spread some of this knowledge –Get local admins at the sites to learn a bit more about their upstream networks

Why T2 → T2 CERN is driving the Tier 0 to Tier 1 and the Tier 1 to Tier 1 but the Tier 2s need to get ready. But no experiment has a use case that calls for transfers between Tier 2 sites? –Test the network/storage infrastructure at each Tier 2 site. –Too many sites to test each against T1 –T1 busy with T0 → T1 and T1 → T1 –T1 → T2 tests run at then end of last year

Physical Infrastructure Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM Generally the network path is: –Departmental Network –Site/University Network –Metropolitan Area Network –JANET (UK’s educational/research backbone) Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

Network Infrastructure DeptSite/UniMANUK Backbone

Physical Infrastructure Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM Generally the network path is: –Departmental Network –Site/University Network –Metropolitan Area Network –JANET (UK’s educational/research backbone) Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

Software Used This is a test of the Grid software stack as well as the T2 hardware. Therefore we try to use that: –Data Sink/Source is the SRM compliant SE –Transfers are done using the File Transfer Service (FTS) –filetransfer script used to submit and monitor the FTS transfers: –Transfers are generally done over the production network to the production software without special short term tweaks

File Transfer Service Fairly recent addition of the LCG Middleware –Manages the transfer of SRM files from one SRM server to another, manages bandwidth and queues and retries failed transfers –Defines “Channels” to transfer files between sites –Generally each T2 has three channels defined T1 → Site T1 ← Site Elsewhere → Site –Each channel sets connection parameters, limits on the number of parallel transfers, VO shares, etc.

Setup GridPP had done some T1 ↔ T2 transfer tests last year Three sites which had already demonstrated > 300 Mb/s Transfer rates in the previous tests chosen as reference sites Each site to be tested nominated a named individual to “own” the tests for their site

Procedure Three weeks before the official start of the tests, the reference sites started testing against each other –Confirmed that they could still achieve the necessary rates –Tested the software to be used in the test Each T2 site was assigned a reference site to be it’s surrogate T1 and a time slot to perform 24hr read and write tests Basic site test was: –Beforehand copy 100 1GB “canned” to the source SRM –Repeatedly transfer these files to the sink for 24hrs –Reverse the flow and copy data from the reference for 24hrs –Rate is simply (N o Files successfully transferred * Size) / Time

Issues / Lessons Loss of a reference site before we even started –despite achieving very good rates in the previous tests, no substantive change by the site and heroic efforts, it could not sustain >250Mb/s Tight timescale –Tests using each reference site were scheduled for each working day so if a site missed its slot or had a problem during the test there was no room to catch up

Issues / Lessons Lack of pre-test tests –Sites only had a 48hr slot for two 24 hour tests and reference sites were normally busy with other tests for so there was little opportunity for sites to tune their storage/channel before the main tests Network Variability –Especially prevalent during the reference site tests –Performance could vary hour by hour by as much as 50% for no reason apparent on the LANs at either end –In the long term, changes upstream (new Firewall, or rate limiting by your MAN) can reduce previous good rates to a trickle

Issues / Lessons Needed a better recipe –With limited opportunities for site admins to try out the software a better recipe for prepare and run the test would have helped. communication wasn’t always ideal –Would have been better to get phone numbers for all the site contacts Ganglia bandwidth plots seem to under estimate the rate

What worked Despite the above — The tests Community Support –Reference sites got early experience running the tests and could help the early sites who in turn could help the next wave and so on Service Reliability –The FTS much more reliable than in previous tests –Some problems with the myproxy service stopping causing transfers to stop Sites owning the tests

Where are we now? 14 out of 19 sites have participated, and have successfully completed 21 out of 38 tests >60 TB of data has been transferred between sites Max recorded transfer rate: 330Mb/s Min recorded transfer rate: 27Mb/s

Now

RAL Castor During the latter part of the tests the new CASTOR at RAL was ready for testing We had a large pool of sites which had already been tested and admin familiar with the test software who could quickly run the same tests with the new CASTOR as the endpoint This enabled us to run tests against CASTOR and get good results whilst still running the main tests In turn helped the CASTOR team in their superhuman efforts to get CASTOR ready for CMS’s CSA06 tests

Conclusions UK Tier Twos have started to prepare for the data challenges that LHC running will bring Network “weather” is variable and can have a big effect As can any one of the upstream network providers

Future Work with sites with low rates to understand and correct them Keep running tests like this regularly: –sites that can do 250Mb/s now should be doing 500Mb/s by next spring and 1GB/s by this time next year

Thanks… Most of the actual work for this was done by Jamie, who co-ordinated everything and the sysadmins, Grieg, Mark, Yves, Pete, Winnie, Graham, Olivier, Alessandra and Santanu, who ran the tests and Matt, who kept the central services running.