Optimization of Large Scale HEP Data Analysis A file staging approach for analysis jobs on Stoomboot Daniela Remenska 1 B-physics Workshop - June 14, 2010
Approach::day 1 Q: What is the underlying problem? A: “If you can build a file stager in three weeks from now, that would be perfect!” ?! Q: What is the underlying problem? A: We have a perception that the analysis jobs running on Stoomboot are inefficient. 2
3
The “Why”? Why Stoomboot for analysis jobs, and not the Grid? 1.The Grid is not so intuitive for users 2.Test the correctness of algorithms 3.Because you can! 4
archive Hi Dox, As I understood it, running on stoomboot has been slow because of I/O issues: running is limited by reading speed of the files instead of by processing speed. It was rather busy on stoomboot these last couple of days. As for the tmpdir problem, I guess you should ask the admins to clean out the tmpdirs....? 5
Bonnie++: benchmarking file systems 6
Problem analysis: Profiling Stoomboot Two basic metrics collected: - CPU time - Wallclock time Number of Files:3 Total Data volume:6 GB (3 x 2GB) Number of Events:
Results:Sequential access with DaVinci File locationsCPU time [min] Main EventLoop time [min] Wallclock time [min] CPU efficiency grid2.fe.infn.it % tbn18.nikhef.nl % nfs partition % local storage % 8
Latency? -bash-3.2$ ping tbn18.nikhef.nl PING tbn18.nikhef.nl ( ) 56(84) bytes of data. 64 bytes from tbn18.nikhef.nl ( ): icmp_seq=1 ttl=61 time=0.478 ms --- tbn18.nikhef.nl ping statistics packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.189/0.285/0.478/0.137 ms -bash-3.2$ grid2.fe.infn.it PING grid2.fe.infn.it ( ) 56(84) bytes of data. 64 bytes from grid2.fe.infn.it ( ): icmp_seq=1 ttl=53 time=26.1 ms time=26.0 ms --- grid2.fe.infn.it ping statistics packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = /26.047/26.100/0.070 ms 9
The stager approach: staging all files before the job starts Advantage: service (file access) closer to user Drawback: storage on Stoomboot not sufficient to keep all data for analysis jobs 10
The stager approach: staging/removing files subsequently Advantage: smaller storage demands Drawback : application blocked due to I/O, wallclock time not reduced 11
The stager approach: prefetching data & overlapping CPU and I/O Advantage: wallclock time significantly reduced Drawback: job blocked at the beginning 12
Demo:Performance evaluation “Feels” like running over local files The only “extra” time due to staging of the first file 60% overhead of data transfer with rfio File stager insensitive to data locality 13 Stager Demo No stager used (rfio access) wallclock time12 min.150 min. CPU time8.51 min.8.57 min CPU efficiency70.9%5.7% Total transfer8 GB13.1 GB Stager Demo No stager used (rfio access) wallclock time11 min.18 min. CPU time8.35 min.11.8 min CPU efficiency75.9%65.5% Total transfer8 GB13 GB
Design of the solution 14
Open questions for users Back-of-the-envelope calculations: What’s the processing time of an event? Expectations: when is the “optimization” sufficient? User friendliness: frustrations? 15
Stoomboot HW/SW 32 Worker Nodes each dual quad-core Intel Xeon 2GHz ; 2/3GB memory/core local disk space ~ 100GB Scientific Linux Cern 5 1Gbps/10Gbps Outside users have no access to SB, but grid files need to be accessed from SB 16