Storage elements discovery

2 How the SE discovery works
Question: which are the best 4 storage elements for me of type disk to send my files to? To answer this we need to know which storage elements are working at the moment Periodic functional tests of all known SEs (currently every 2h) add, get, remove of a test file from a remote location The status of an SE can also be set by the administrators (unreliable storage elements) ALICE Offline Week Storage elements discovery

SE testing results ALICE Offline Week Storage elements discovery

4 Network topology discovery
To determine which are the closest storage elements to the client, network topology information is needed MonALISA performs tracepath/traceroute between all VOBoxes Recording RTT to each router and the target VoBox Part of the test suite is the bandwidth estimation Each SE is associated a set of IP addresses The IP of the VOBox on the site IPs of xrootd redirector & nodes ALICE Offline Week Storage elements discovery

5 Discovered network topology – all routers and RTT between nodes
6 Derived network topology
But if we ... group the routers in the respective Autonomous Systems (AS) compute the distance (RTT) between them then we have a better understanding of the relation between sites ALICE Offline Week Storage elements discovery

9 Client to Storage distance
distance(IP, IP) Same C-class network Common domain name Same AS Same country (+ function of RTT between the respective AS numbers, if known) If distance between the AS nos. is known, use it Same continent Far far away distance(IP, Set<IP>): Client's public IP to all known IPs for the storage 1 ALICE Offline Week Storage elements discovery

Integration in AliEn Synchronously Asynchronously Policies Cache of SE rankings for each site AliEn Authen service 2 SE Rank Optimizer 3 1 4 ML Site A MonALISA Repository Access token ML Site Z Functional tests SE 1 SE 2 SE 3 Agent (Job or User) : What are the 2 closest SEs of type ”disk” ? ALICE Offline Week Storage elements discovery

Samples Job executed at JINR Job executed at KOLKATA ALICE Offline Week Storage elements discovery

What was gained Flexible storage configuration QoS tags are all that users should know about the system Not yet for reading, but getting there Maintenance-free system Monitoring feedback on known elements and automatic discovery and configuration of new resources Reliable and efficient file access No more failed jobs due to auto discovery and failover in case of temporary problems Use the closest working storage element(s) to where the application runs ALICE Offline Week Storage elements discovery

13 Additional information
The topology and available bandwidth are archived Now that we have some history, interesting effects show up in the long run Upgrading to SLC5 has increased the average available bandwidth between all sites in the ALICE Grid (for one stream) 7x There is still room to optimize remote data access by tuning the kernel parameters on all machines involved in the system (storage servers, worker nodes, VoBoxes) ALICE Offline Week Storage elements discovery

14 Where we have started from
16 Max available bandwidth within continents
New upper limits from default SLC5 buffers Well tuned both sides 100Mbit still ? Old machines / bad configuration ALICE Offline Week Storage elements discovery

Remarks The network performance has improved a lot there is still room for optimizations Remote access to files can already be considered a viable option opening the door to removing the hard dependency of data location and where the jobs are running distance metric could be used to prefer jobs that access close-by data files Topology information is now critical for correct decisions some sites are blocking tracepath/traceroute to the VoBoxes, please make sure the following ports are allowed from the world UDP/ ICMP TCP/1093 (for bandwidth estimation) ALICE Offline Week Storage elements discovery

18 Evolution of throughput vs RTT
