Download presentation
Presentation is loading. Please wait.
Published byErik Crawford Modified over 9 years ago
1
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006
2
Outline What are we testing and why? What is the setup? Hardware and Software Infrastructure Test Procedures Lessons and Successes RAL Castor Conclusions and Future
3
What and Why What: –Set up systems and people to test the rates at which the UK Tier 2 sites can import and export data Why: –Once the LHC experiments are up and running Tier 2 sites will need to absorb data from and upload data to the Tier 1s at quite alarming rates: ~1Gb/s for a medium sized Tier 2 –UK has a number of “experts” in tuning DPM/dCache, this should spread some of this knowledge –Get local admins at the sites to learn a bit more about their upstream networks
4
Why T2 → T2 CERN is driving the Tier 0 to Tier 1 and the Tier 1 to Tier 1 but the Tier 2s need to get ready. But no experiment has a use case that calls for transfers between Tier 2 sites? –Test the network/storage infrastructure at each Tier 2 site. –Too many sites to test each against T1 –T1 busy with T0 → T1 and T1 → T1 –T1 → T2 tests run at then end of last year
5
Physical Infrastructure Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM Generally the network path is: –Departmental Network –Site/University Network –Metropolitan Area Network –JANET (UK’s educational/research backbone) Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s
6
Network Infrastructure DeptSite/UniMANUK Backbone
7
Physical Infrastructure Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM Generally the network path is: –Departmental Network –Site/University Network –Metropolitan Area Network –JANET (UK’s educational/research backbone) Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s
8
Software Used This is a test of the Grid software stack as well as the T2 hardware. Therefore we try to use that: –Data Sink/Source is the SRM compliant SE –Transfers are done using the File Transfer Service (FTS) –filetransfer script used to submit and monitor the FTS transfers: http://www.physics.gla.ac.uk/~graeme/scripts/|filetransfer –Transfers are generally done over the production network to the production software without special short term tweaks
9
File Transfer Service Fairly recent addition of the LCG Middleware –Manages the transfer of SRM files from one SRM server to another, manages bandwidth and queues and retries failed transfers –Defines “Channels” to transfer files between sites –Generally each T2 has three channels defined T1 → Site T1 ← Site Elsewhere → Site –Each channel sets connection parameters, limits on the number of parallel transfers, VO shares, etc.
10
Setup GridPP had done some T1 ↔ T2 transfer tests last year Three sites which had already demonstrated > 300 Mb/s Transfer rates in the previous tests chosen as reference sites Each site to be tested nominated a named individual to “own” the tests for their site
11
Procedure Three weeks before the official start of the tests, the reference sites started testing against each other –Confirmed that they could still achieve the necessary rates –Tested the software to be used in the test Each T2 site was assigned a reference site to be it’s surrogate T1 and a time slot to perform 24hr read and write tests Basic site test was: –Beforehand copy 100 1GB “canned” to the source SRM –Repeatedly transfer these files to the sink for 24hrs –Reverse the flow and copy data from the reference for 24hrs –Rate is simply (N o Files successfully transferred * Size) / Time
12
Issues / Lessons Loss of a reference site before we even started –despite achieving very good rates in the previous tests, no substantive change by the site and heroic efforts, it could not sustain >250Mb/s Tight timescale –Tests using each reference site were scheduled for each working day so if a site missed its slot or had a problem during the test there was no room to catch up
13
Issues / Lessons Lack of pre-test tests –Sites only had a 48hr slot for two 24 hour tests and reference sites were normally busy with other tests for so there was little opportunity for sites to tune their storage/channel before the main tests Network Variability –Especially prevalent during the reference site tests –Performance could vary hour by hour by as much as 50% for no reason apparent on the LANs at either end –In the long term, changes upstream (new Firewall, or rate limiting by your MAN) can reduce previous good rates to a trickle
14
Issues / Lessons Needed a better recipe –With limited opportunities for site admins to try out the software a better recipe for prepare and run the test would have helped. Email communication wasn’t always ideal –Would have been better to get phone numbers for all the site contacts Ganglia bandwidth plots seem to under estimate the rate
15
What worked Despite the above — The tests Community Support –Reference sites got early experience running the tests and could help the early sites who in turn could help the next wave and so on Service Reliability –The FTS much more reliable than in previous tests –Some problems with the myproxy service stopping causing transfers to stop Sites owning the tests
16
Where are we now? 14 out of 19 sites have participated, and have successfully completed 21 out of 38 tests >60 TB of data has been transferred between sites Max recorded transfer rate: 330Mb/s Min recorded transfer rate: 27Mb/s
17
Now
18
RAL Castor During the latter part of the tests the new CASTOR at RAL was ready for testing We had a large pool of sites which had already been tested and admin familiar with the test software who could quickly run the same tests with the new CASTOR as the endpoint This enabled us to run tests against CASTOR and get good results whilst still running the main tests In turn helped the CASTOR team in their superhuman efforts to get CASTOR ready for CMS’s CSA06 tests
19
Conclusions UK Tier Twos have started to prepare for the data challenges that LHC running will bring Network “weather” is variable and can have a big effect As can any one of the upstream network providers
20
Future Work with sites with low rates to understand and correct them Keep running tests like this regularly: –sites that can do 250Mb/s now should be doing 500Mb/s by next spring and 1GB/s by this time next year
21
Thanks… Most of the actual work for this was done by Jamie, who co-ordinated everything and the sysadmins, Grieg, Mark, Yves, Pete, Winnie, Graham, Olivier, Alessandra and Santanu, who ran the tests and Matt, who kept the central services running.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.