Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Distribution Performance Hironori Ito Brookhaven National Laboratory.

Similar presentations


Presentation on theme: "Data Distribution Performance Hironori Ito Brookhaven National Laboratory."— Presentation transcript:

1 Data Distribution Performance Hironori Ito Brookhaven National Laboratory

2 Old – Cloud centric Hierarchical distribution chain – Well tested and monitored. » Regular tests between BNL and T2s. T0->T1s->T2s – In US, BNL -> US T2s. 400MB/s at minimum with sustained level. – BNL can push at 20Gb/s now, and going 40GB/s. – Some T2s are able to show the sustain transfer rate at 10Gb/s – Pre-placement Data is placed well before the usage regardless of needs New – Non cloud centric T2Ds – T2s becomes part of all clouds – T2s are used as a source of data – Minimum pre-placement PD2P – Data is placed as needed. Change in distribution model

3 Further change A lot more smaller consumers. – T3s 50 (or more) T3s coming soon world wide – Individual users dq2-get xRootd: new type of requests – Although each of them are small, the combined requests could rival or exceed the level of T1 – Non-experienced users/sites administrators Unexpected, creative (not necessary efficient) use of DDM resources

4 Consequence of change More point-to-point transfers – T2s to all T1s: PANDA JOBS – T2s to all T1s/T2s: DDM distributions  T2s are no longer protected by cloud FTS.  The number of requests could dramatically increase at certain times (by factor of 10 or more) More on-demand transfers – Timely delivery of a particular data is more crucial Jobs/users won’t wait forever.

5 Current status Sonar tests – Regular tests by the central operation. – Data are sent from various T1s/T2s to many T1s/T2s. – Results are shown at http://bourricot.cern.ch/dq2/ftsmon/sonar_view/ cached/ http://bourricot.cern.ch/dq2/ftsmon/sonar_view/ cached/ – Results are used to evaluate eligibility of T2s to be promoted to T2D

6 Some Sonar results (MWT2)

7 More Sonar results (SWT2CPB)

8 Other tests US’s own regular tests are expanded for foreign T1s to T2. – Opposite directions is also considered. All T1s to BNL and BNL to all T1s are already conducted. – It is run twice a day using 10 large files to measure the network throughput. – The results are shown at the same as before (http://www.usatlas.bnl.gov/dq2/throughput)

9 Throughput test results (AGLT2)

10 Identification of a problem Who will find problematic transfers – Shifters Not capable of identifying any detail – Just looking at Green, Yellow, Red light at some place – Requesters: panda jobs, users, production Requesters always know when they don’t get what they want. – Sometimes, it is quite late to notice the problems. – There is usually some rudimentary effort to debug a problem by some experienced users. – The cause is almost always assumed to be the destination. » Eg. I can get DSN at site A, but not at B. Therefore, it must be B.  However, they don’t realized that it also does not work at sites C, D, E and F. – Site administrators

11 “Slow” transfer is notoriously difficult to find the cause. Generally, a ticket is sent to the destination simply because shifters are trained that way. And, it is something to do with the fact that DDM dashboard summarizes the transfer failures based on the destinations. Source and destination sites can only eliminate the performance of own sites as sources of problem. – A problem could be anywhere between the source and destination – Someone have to take ownership for quick resolution Who is responsible for fixing problematic transfers

12 Monitoring/debugging T2 administrators are completely blind in terms of debugging DDM transfers from own sites to foreign T1s/T2s. – CERN DQ2 logs are not visible remotely – T1s FTS log (except BNL) are not visible remotely For foreign T1s/T2s to own T2, you can still see them on the regular BNL DQ2 and FTS log viewers. Network performance between T2s and foreign T1s/T2s are not currently monitored. And, the traffic will compete the bandwidth with general traffic.

13 How to find, identify and resolve problematic transfers DDM dashboard will show the failed transfers Then… – Analyze the monitor more carefully. Is it failing for all sites or just one (or few) Are there any correlations with other failed sites? – Look at the results of the control data. Sonar and US throughput results? – US throughput test will provides BNL administrators the logs – Initiate a new test on demand. – Network issue. Traceroute, contact the network engineer, etc… Finding the location of problem is currently difficult!

14 Creative use of DDM clients by users Users generally do not understand physical limitations of getting data. – No network bandwidth limit – No disk IO bandwidth limit at source or destination Since no limitation, the use could be quite creative and destructive. – One user was found to be doing about simultaneous 1500 download from two hosts (150 dq2-get with 10 concurrency option) with one 1Gb/s NIC and writing to NFS area limited to about 70MB/s. Although it did not cause a problem at source sites, the load of two hosts doing dq2-get was very high, making client hosts unusable. He/she would have achieve the same performance with 1/100 th of number of concurrency without crashing the client hosts.

15 More problems are coming Good luck! Look at the existing information carefully to identify any problems. More tools and information would help. – Tests on demand US throughput page will include the ability for a site administrator to initiate the transfer on demand and look at the transfer logs. – PerfSonoar It would greatly help if the pure network performance results are available. Good and direct communications with network engineers and remote site administrators are needed. – Probably need help from T1s and T0


Download ppt "Data Distribution Performance Hironori Ito Brookhaven National Laboratory."

Similar presentations


Ads by Google