Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting.

Similar presentations


Presentation on theme: "Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting."— Presentation transcript:

1 Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting

2 SCA Structured Cluster Architecture for LLA Low Latency Access xrootd Low Latency Access to data via xrootd servers POSIX-style byte-level random access By default, arbitrary data organized as files Hierarchical directory-like name space Protocol includes high performance features cmsd Structured Clustering provided by cmsd servers Exponentially scalable and self organizing ANL Tier3 Meeting 08-June-102 Scalla What is Scalla?

3 ANL Tier3 Meeting 08-June-103 What People Like About xrootd It’s really simple and easy to administer Requires basic file system administration knowledge And becoming familiar with xrootd nomenclature No 3rd party software maintenance (i.e., self-contained) Handles heavy loads E,g., >3,000 connections and >10,000 open files Resilient and forgiving Failures handled in a natural way Configuration changes can be done in real time E.g., Adding or removing servers

4 ANL Tier3 Meeting 08-June-104 The Basic Building Blocks Application Linux NFS Server Linux NFS Client Client Machine Server Machine Alternatively xrootd is nothing more than an application level file server & client using another protocol DataFiles Application Linux Linux Client Machine Server Machine xroot Server DataFiles xroot Client

5 ANL Tier3 Meeting 08-June-105 Why Not Just Use NFS? NFS V2 & V3 inadequate Scaling problems with large batch farms Unwieldy when more than one server needed Doesn’t NFS V4 support multiple servers? Yes, but… Difficult to maintain a scalable uniform name space Still has single point of failure problems Performance on par with NFS V3

6 ANL Tier3 Meeting 08-June-106 NFS & Multiple File Servers Which Server? Linux NFS Server Server Machine A DataFiles Application Linux NFS Client Client Machine Linux NFS Server Server Machine B DataFiles open(“/foo”); NFS V4 uses manual referrals to redirect client. This still ties part of a name space to a particular server requiring that you also deploy pNFS to spread data across servers. cp /foo /tmp

7 ANL Tier3 Meeting 08-June-107 The Scalla Approach DataFiles Application Linux Client Machine Linux Server Machine B DataFiles open(“/foo”); xroot Client Linux Server Machine A xroot Server Linux Server Machine R xroot Server /foo Redirector 1 Who has /foo? 2 I do! 3 Try B 4 open(“/foo”); xrdcp root://R//foo /tmp The xroot client does all of these steps automatically without application (user) intervention!

8 8 File Discovery Considerations The redirector does not have a catalog of files It always asks each server, and Caches the answers in memory for a “while” So, it won’t ask again when asked about a past lookup Allows real-time configuration changes Clients never see the disruption This is optimal for reading (80% use-case) The lookup takes << 100 microseconds when files exist Much longer when a requested file does not exist! This is the write case with more details found herehere ANL Tier3 Meeting 08-June-10

9 9 Events When File Does Not Exist DataFiles Application Linux Client Machine Linux Server Machine B DataFiles open(“/foo”); xroot Client Linux Server Machine A xroot Server Linux Server Machine R xroot Server /foo Redirector 1 Who has /foo? 2Nope!5 File deemed not to exist if there is no response after 5 seconds! xrdcp root://R//foo /tmp ANL Tier3 Meeting 08-June-10

10 10 File Discovery Considerations II System optimized for “file exists” case! Penalty for going after missing files Aren’t new files, by definition, missing? Yes, but that involves writing data! The system is optimized for reading data So, creating a new file may suffer a 5 second delay Can minimize the delay by using the xprep command Primes the redirector’s file memory cache ahead of time Can files appear to be missing any other way? ANL Tier3 Meeting 08-June-10

11 11 Missing File vs. Missing Server In xroot files exist to the extent servers exist The redirector cushions this effect for 10 minutes Afterwards, the redirector cannot tell the difference This allows partially dead server clusters to continue Jobs hunting for “missing” files will eventually die But jobs cannot rely on files actually being missing xroot cannot provide a definitive answer to “  s:  x” This requires manual safety for file creation ANL Tier3 Meeting 08-June-10

12 12 Safe File Creation Avoiding the basic problem.... Today’s new file may be on yesterday’s dead server Generally, do not re-use output file names Use temporary file names when creating new files xrdcp –f /tmp/myfile root://redirector//path/myfile.temp xrd redirector mv /path/myfile.temp /path/myfile Enable the xrootd posc feature Files persist only when successfully closed xrdcp –f /tmp/myfile root://redirector//path/myfile?posc=t This is rarely a problem in practice ANL Tier3 Meeting 08-June-10

13 Nomenclature Review xrootd Server that provides access to datacmsd Server that glues xrootd’s together xrootdcmsd Redirector (special xrootd-cmsd pair) The head node that clients always contact Looks just like a data server but is Responsible for directing clients to correct server ANL Tier3 Meeting 08-June-1013 cmsd xrootd ALWAYSPAIRED

14 Clustering Mechanics ANL Tier3 Meeting 08-June-1014 cmsd Clustering provided by cmsd processes xrootd Oversees the health and name space on each xrootd server Maps file names to the servers that have the file xrootd Informs client via an xrootd server about the file’s location All done in real time without using any databases xrootdcmsd Each xrootd server process talks to a local cmsd process Communicate over a Unix named (i.e., file system) socket cmsdcmsd Local cmsd’s communicate to a manager cmsd elsewhere Communicate over a TCP socket role Each process has a specific role in the cluster

15 Role Nomenclature Manager Role xrootdcmsd Assigned to head node xrootd-cmsd pair (i.e., redirector) Keeps track of path to the file Guides clients to the actual file Decides what server is to be used for a request Server Role xrootdcmsd Assigned to data node xrootd-cmsd pairs Keeps track of xrootd utilization and health Reports statistics to the manager Provides actual data service ANL Tier3 Meeting 08-June-1015

16 Example of a Simple Cluster ANL Tier3 Meeting 08-June-1016 xrootd cmsd xrootd cmsd xrootd cmsd Data Server Node a.slac.stanford.edu Manager Node x.slac.stanford.edu Data Server Node b.slac.stanford.edu Which one do clients connect to? all.role server all.role manager if x.slac.stanford.edu all.manager x.slac.stanford.edu 1213 Configuration File: (vdt config makes it) Note: All processes can be started in any order!

17 Preparing To Build A Cluster Select unprivileged user to run xrootd and cmsd Use will own mount points and administrative paths Decide exported file-system mount-point(s) Much easier if it’s the same for all servers Decide where admin files will be written Log and core files (for example, /var/adm/xrootd/logs) Other special files (for example, /var/adm/xrootd/admin) Decide where the software will be installed Should be the same everywhere (for example, /opt/xrootd) ANL Tier3 Meeting 08-June-1017

18 Digression on Mount Points Easy if you have a single file system per node More involved when you have more Solution is to use the oss linked file system (more later) You can simplify your life by… Create a single directory in ‘/’, say /xrootd Mount all xrootd handled file systems there E.g., /xrootd/fs01, /xrootd/fs02, etc You can make them all available with a single directive oss.cache public /xrootd/* xa ANL Tier3 Meeting 08-June-1018

19 Multiple File Systems ANL Tier3 Meeting 08-June-1019 The oss allows you to aggregate partitions Each partition is mounted as a separate file system An exported path can refer to all the partitions The oss automatically handles it by creating symlinks File name in /atlas is a symlink to a real file in /xrootd/fs01 or /xrootd/fs02 /xrootd/fs01 /xrootd/fs02 /atlas Mounted Partitions hold file data File system used to hold exported file paths symlink oss.cache public /xrootd/* xa all.export /atlas The oss linked file system

20 Building A Simple Cluster * Identify One* manager node and up to 64 data servers Verify that prep work has been properly done Directories created, file systems mounted, etc. All of these directories must be owned by the xrootd user This setup normally requires root privileges, but…. Root privileges really not needed after this point Create configuration file describing the setup above Always the same one for each node vdt tools automate configuration build Install the same software & config file on each node *Can have any number. ANL Tier3 Meeting 08-June-1020

21 Running A Simple Cluster The system runs as an unprivileged user The exported file system & special directory owner xrootdcmsd Note: xrootd and cmsd refuse to run as user root xrootdcmsd Start xrootd-cmsd pair on each node Can use Startup/Shutdown scripts provided StartXRD, StopXRD, StartCMS, StopCMS The StartXRD.cf provides location of various programs Good to add a cron job scout to restart pair May want to replicate the head node While simple it requires some thought ANL Tier3 Meeting 08-June-1021

22 22 Full Cluster Overview xrootd protocol for random I/O PaPaPaPa Grid protocol for sequential bulk I/O PgPgPgPg xrootd cluster GridFTPftpd Supports >200K data servers MachineMachineMachine FUSEMachine XXXCCC Xxrootd Ccmsdredirector Minimum for a cluster Globus ftpd with or without xrootdFS xrootdFS ANL Tier3 Meeting 08-June-10

23 23 Getting to xrootd hosted data Via the root framework Automatic when files named root://.... Manually, use TXNetFile() object Note: identical TFile() object will not work with xrootd! xrdcp The native copy command POSIX preload library Allows POSIX compliant applications to use xrootdPOSIX xprep ( speeding access to files ) speeding access to files gridFTP FUSE & xrootdFS Linux only: xrootd cluster as a mounted file system Native Set Simple Add Intensive Full Grid Set Intensive

24 ANL Tier3 Meeting 08-June-1024 The Native Copy Command xrdcp [options] source dest Copies data to/from xrootd servers Some handy options: -ferase dest before copying source -sstealth mode (i.e., produce no status messages) -S nuse n parallel streams (use only across WAN) Back

25 ANL Tier3 Meeting 08-June-1025 Speeding Access To Files xprep [options] host[:port] [path [...]] Prepares xrootd access via redirector host:port Minimizes wait time if you are creating many files Some handy options: -wfile will be created or written -f fnfile fn holds a list of paths, one per line Back

26 ANL Tier3 Meeting 08-June-1026 Providing Transparent Access POSIX preload library (libXrdPosixPreload.so) Works with any POSIX I/O compliant program Provides direct access to xrootd hosted data Does not need any changes to the application Just run the binary as is Back

27 The Optional Additions FUSE Allows a file system view of a cluster Good for managing the name space Not so good for actual data access (overhead) People usually mount it on an admin node GridFTP Bulk transfer program Can use on top of FUSE or with preload library ANL Tier3 Meeting 08-June-1027

28 ANL Tier3(g,w) Meeting 19-May-0928 Using FUSE Redirector is mounted as a file system Say, “/xrootd” on interactive node Use this for typical operations dq2-get, dq2-put, dq2-ls mv, rm, etc. Can also use as a base for GridFTP Though POSIX preload library works as well

29 ANL Tier3 Meeting 08-June-1029 Using Grid FTPopen(/a/b/c) 1 st Point of Contact (Standard GSIftp Server) open(/a/b/c) Subsequent Data Access Standard xrootd Cluster FTP servers can be Firewalled and Replicated for scaling PreloadLibraryGSIFTP xrootd Client Preload Library Can be replaced by FUSE

30 Typical Setup ANL Tier3 Meeting 08-June-1030 xrootd cmsd xrootd cmsd Data Nodes Manager Node gridFTP dq2get Node dq2get PosixPreloadLibrary Basic xrootd Cluster = Simple Grid Access dq2get Node (gridFTP + POSIX Preload Lib) + dq2get Node (gridFTP + FUSE) Even more effective if using a VMSS xrootdFS

31 Other Good Things To Consider Security Automatic server inventory Summary Monitoring In addition to the VDT Gratia Storage Probe Opportunistic Clustering Expansive Clustering ANL Tier3 Meeting 08-June-1031

32 Security Normally r/o access needs no security Allowing writes, however does VDT configuration defines a basic security model Unix uid based (essentially like NFS) Everyone has r/o access Special users have r/w access Actual model can be arbitrarily complicated Best to stick with the simplest one that works Privilege definition reside in a special file ANL Tier3 Meeting 08-June-1032

33 Security Example Security needs to be configured ANL Tier3 Meeting 08-June-1033 xrootd.seclib /opt/xrootd/lib/libXrdSec.so sec.protocol unix ofs.authlib /opt/xrootd/authfile ofs.authorize Much of this is done by VDT configuration u * /atlas r u myid /atlas a Authorizations need to be declared Placed in /opt/xrootd/authfile (as stated above)

34 Simple Server Inventory (SSI) ANL Tier3 Meeting 08-June-1034 A central file inventory of each data server Good for basic sites needing a server inventories Inventory normally maintained on each redirector Automatically recreated when lost Updated using rolling log files Effectively no performance impact “cns_ssi update” merges all of the inventories Flat text file format LFN, Mode, Physical partition, Size, Space token “cns_ssi list” command provides formatted output

35 Summary Monitoring ANL Tier3 Meeting 08-June-1035 Needed information in almost any setting Xrootd can auto-report summary statistics Specify xrd.report configuration directive Data sent to one or two locations Use provided mpxstats as the feeder program Multiplexes streams and parses xml into key-value pairs Pair it with any existing monitoring framework Ganglia, GRIS, Nagios, MonALISA, and perhaps more

36 Summary Monitoring Setup ANL Tier3 Meeting 08-June-1036 DataServers Monitoring Host mpxstats xrd.report monhost:1999 all every 15s monhost:1999 ganglia

37 Opportunistic Clustering Xrootd extremely efficient of machine resources Ultra low CPU usage with a memory footprint 20 ≈ 80MB Ideal to cluster just about anything (e.g., proof clusters) ANL Tier3 Meeting 08-June-1037 Clustered Storage System Leveraging Batch Node Disks 37 cmsd xrootd job job cmsd xrootd cmsd xrootd Batch Nodes File Servers Redirector

38 Opportunistic Clustering Caveats ANL Tier3 Meeting 08-June-1038 Using batch worker node storage is problematic Storage services must compete with actual batch jobs At best, may lead to highly variable response time At worst, may lead to erroneous redirector responses Additional tuning will be required Normally need to renice the cmsd and xrootd As root: renice –n -10 –p cmsd_pid As root: renice –n -5 –p xroot_pid You must not overload the batch worker node Especially true if exporting local work space

39 Expansive Clustering ANL Tier3 Meeting 08-June-1039 Xrootd can create ad hoc cross domain clusters Good for easily federating multiple sites This is the ALICE model of data management Provides a mechanism for “regional” data sharing Get missing data from close by before using dq2get Architecture allows this to be automated & demand driven This implements a Virtual Mass Storage System

40 Virtual Mass Storage System ANL Tier3 Meeting 08-June-1040 cmsd xrootd UTA cmsd xrootd UOM cmsd xrootd BNL all.role meta manager all.manager meta atlas.bnl.gov:1312root://atlas.bnl.gov/includes SLAC, UOM, UTA xroot clusters Meta Managers can be geographically replicated! cmsd xrootd SLAC all.manager meta atlas.bnl.gov:1312 all.role manager

41 What’s Good About This? ANL Tier3 Meeting 08-June-1041 Fetch missing files in a timely manner Revert to dq2get when file not in regional cluster Sites can participate in an ad hoc manner The cluster manager sorts out what’s available Can use R/T WAN access when appropriate Can significantly increase WAN xfer rate Using multiple-source torrent-style copying

42 Expansive Clustering Caveats ANL Tier3 Meeting 08-June-1042 Federation & Globalization are easy if.... Federated servers are not blocked by a firewall No ALICE xroot servers are behind a firewall There are alternatives.... Implement firewall exceptions Need to fix all server ports Use proxy mechanisms Easy for some services, more difficult for others All of these have been tried in various forms Site’s specific situation dictates appropriate approach

43 In Conclusion... ANL Tier3 Meeting 08-June-1043 Xrootd is a lightweight data access system Suitable for resource constrained environments Human as well as hardware Geared specifically for efficient data analysis Supports various clustering models E.g., PROOF, batch node clustering and WAN clustering Has potential to greatly simplify Tier 3 deployments Distributed as part of the OSG VDT Also part of the CERN root distribution Visit http://xrootd.slac.stanford.edu/

44 ANL Tier3 Meeting 08-June-1044 Acknowledgements Software Contributors Alice: Derek Feichtinger CERN: Fabrizio Furano, Andreas Peters Fermi: Tony Johnson (Java) Root: Gerri Ganis, Beterand Bellenet, Fons Rademakers STAR/BNL: Pavel Jackl SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger BeStMan LBNL: Alex Sim, Junmin Gu, Vijaya Natarajan (BeStMan team) Operational Collaborators BNL, FZK, IN2P3, RAL, UVIC, UTA Partial Funding US Department of Energy Contract DE-AC02-76SF00515 with Stanford University


Download ppt "Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting."

Similar presentations


Ads by Google