Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05
13-October-052: Outline The state of performance Single server Clustered servers The SRM Debate The Next Big Thing Conclusion
13-October-053: Application Design Point Complex embarrassingly parallel analysis Determine particle decay products 1000’s of parallel clients hitting the same data Small block sparse random access Median size < 3K Uniform seek across whole file (mean 650MB) Only about 22% of the file read (mean 140MB)
13-October-054: Performance Measurements Goals Very low latency Handle many parallel clients Test setup Sun V20z 1.86MHz dual Opteron, 2GB RAM 1Gb on board Broadcom NIC (same subnet) Solaris 10 x86 Linux RHEL ELsmp Client running BetaMiniApp with analysis removed
13-October-055: Latency Per Request (xrootd)
13-October-056: Capacity vs Load (xrootd)
13-October-057: xrootd Server Scaling Linear scaling relative to load Allows deterministic sizing of server Disk NIC CPU Memory Performance tied directly to hardware cost Competitive to best-in-class commercial file servers
13-October-058: OS Impact on Performance
13-October-059: Device & Filesystem Impact CPU limited I/O limited 1 Event 2K UFS good on small reads VXFS good on big reads
13-October-0510: Overhead Distribution
13-October-0511: Network Overhead Dominates
13-October-0512: Xrootd Clustering (SLAC) client machines kan01kan02kan03kan04 kanxx bbr-olb03bbr-olb04 kanolb-a Hidden Details Redirectors
13-October-0513: Clustering Performance Design can scale to at least 256,000 servers SLAC runs a 1,000 node test server cluster BNL runs a 350 node production server cluster Self-regulating (via minimal spanning tree algorithm) 280 nodes self-cluster in about 7 seconds 890 nodes self-cluster in about 56 seconds Client overhead is extremely low Overhead added to meta-data requests (e.g., open) ~200us * log 64 (number of servers) / 2 Zero overhead for I/O
13-October-0514: Cluster Fault Tolerance Server and resources may come and go New servers can be added/removed at any time Files can be moved around in real-time Clients simply adjust to the new configuration Client-side interface handles recovery protocol Uses Real-Time Client Steering Protocol reactive Can be used to perform reactive client scheduling Any volunteers for the bleeding edge?
13-October-0515: Current MSS Support Lightweight agnostic interfaces provided oss.mssgwcmd command Invoked for each create, dirlist, mv, rm, stat oss.stagecmd |command Long running command, request stream protocol Used to populate disk cache (i.e., “stage-in”) xrootd (oss layer) mssgwcmd MSS stagecmd
13-October-0516: Future Leaf Node SRM MSS Interface ideal spot for SRM hook Use existing hooks or new long running hook mssgwcmd & stagecmd oss.srm |command Processes external disk cache management requests Should scale quite well xrootd (oss layer) srm MSS Grid
13-October-0517: BNL/LBL Proposal srm drm das Generic Standard LBL xrootddm rc Replica Services BNL Replica Registration Service & DataMover
13-October-0518: Alternative Root Node SRM Team olbd with SRM File management & discovery Tight management control Several issues need to be considered Introduces many new failure modes Will not generally scale olbd (root node) srm MSS Grid
13-October-0519: SRM Integration Status Unfortunately, SRM interface in flux Heavy vs light protocol Working with LBL team Working towards OSG sanctioned future proposal Trying to use the Fermilab SRM Artem Turnov at IN2P3 exploring issues
13-October-0520: The Next Big Thing High Performance Data Access Servers plus Efficient large scale clustering Allows Novel cost-effective super-fast massive storage Optimized for sparse random access Imagine 30TB of DRAM At commodity prices
13-October-0521: Device Speed Delivery
13-October-0522: Memory Access Characteristics Server: zsuntwo CPU: Sparc NIC: 100Mb OS: Solaris 10 UFS: Sandard
13-October-0523: The Peta-Cache Cost-effect memory access impacts science Nature of all random access analysis Not restricted to just High Energy Physics Enables faster and more detailed analysis Opens new analytical frontiers Have a 64-node test cluster V20z each with 16GB RAM 1TB “toy” machine
13-October-0524: Conclusion High performance data access systems achievable The devil is in the details Must understand processing domain and deployment infrastructure Comprehensive repeatable measurement strategy High performance and clustering are synergetic Allows unique performance, usability, scalability, and recoverability characteristics Such systems produce novel software architectures Challenges Creating application algorithms that can make use of such systems Opportunities Fast low cost access to huge amounts of data to speed discovery
13-October-0525: Acknowledgements Fabrizio Furano, INFN Padova Client-side design & development Bill Weeks Performance measurement guru 100’s of measurements repeated 100’s of times US Department of Energy Contract DE-AC02-76SF00515 with Stanford University And our next mystery guest!