Scalla In’s & Out’s xrootdcmsd xrootd /cmsd Andrew Hanushevsky SLAC National Accelerator Laboratory OSG Administrator’s Work Shop Stanford University/SLAC.

Slides:



Advertisements
Similar presentations
Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
Advertisements

NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
VMware vCenter Server Module 4.
Test Review. What is the main advantage to using shadow copies?
Fundamentals of Networking Discovery 1, Chapter 2 Operating Systems.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring, Managing, and Troubleshooting Resource Access.
Scalla Back Through The Future Andrew Hanushevsky SLAC National Accelerator Laboratory Stanford University 8-April-10

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Networked File System CS Introduction to Operating Systems.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
1 Network File Sharing. 2 Module - Network File Sharing ♦ Overview This module focuses on configuring Network File System (NFS) for servers and clients.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Scalla/xrootd Andrew Hanushevsky SLAC National Accelerator Laboratory Stanford University 19-August-2009 Atlas Tier 2/3 Meeting
SRM at Clemson Michael Fenn. What is a Storage Element? Provides grid-accessible storage space. Is accessible to applications running on OSG through either.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Distributed File Systems
Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 19-May-09 ANL Tier3(g,w) Meeting.
Scalla/xrootd Andrew Hanushevsky SLAC National Accelerator Laboratory Stanford University 29-October-09 ATLAS Tier 3 Meeting at ANL
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
A. Sim, CRD, L B N L 1 OSG Applications Workshop 6/1/2005 OSG SRM/DRM Readiness and Plan Alex Sim / Jorge Rodriguez Scientific Data Management Group Computational.
Xrootd, XrootdFS and BeStMan Wei Yang US ATALS Tier 3 meeting, ANL 1.
Scalla/xrootd Introduction Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 6-April-09 ATLAS Western Tier 2 User’s Forum.
SLAC Experience on Bestman and Xrootd Storage Wei Yang Alex Sim US ATLAS Tier2/Tier3 meeting at Univ. of Chicago Aug 19-20,
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Xrootd Monitoring Atlas Software Week CERN November 27 – December 3, 2010 Andrew Hanushevsky, SLAC.
July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
 CASTORFS web page - CASTOR web site - FUSE web site -
Xrootd Update Andrew Hanushevsky Stanford Linear Accelerator Center 15-Feb-05
02-June-2008Fabrizio Furano - Data access and Storage: new directions1.
Accelerating Debugging In A Highly Distributed Environment CHEP 2015 OIST Okinawa, Japan April 28, 2015 Andrew Hanushevsky, SLAC
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Performance and Scalability of xrootd Andrew Hanushevsky (SLAC), Wilko Kroeger (SLAC), Bill Weeks (SLAC), Fabrizio Furano (INFN/Padova), Gerardo Ganis.
Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05
Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting.
Scalla Advancements xrootd /cmsd (f.k.a. olbd) Fabrizio Furano CERN – IT/PSS Andrew Hanushevsky Stanford Linear Accelerator Center US Atlas Tier 2/3 Workshop.
Scalla Authorization xrootd /cmsd Andrew Hanushevsky SLAC National Accelerator Laboratory CERN Seminar 10-November-08
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
SRM Space Tokens Scalla/xrootd Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 27-May-08
Scalla As a Full-Fledged LHC Grid SE Wei Yang, SLAC Andrew Hanushevsky, SLAC Alex Sims, LBNL Fabrizio Furano, CERN SLAC National Accelerator Laboratory.
Review CS File Systems - Partitions What is a hard disk partition?
Federated Data Stores Volume, Velocity & Variety Future of Big Data Management Workshop Imperial College London June 27-28, 2013 Andrew Hanushevsky, SLAC.
1 Xrootd-SRM Andy Hanushevsky, SLAC Alex Romosan, LBNL August, 2006.
11-June-2008Fabrizio Furano - Data access and Storage: new directions1.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Bestman & Xrootd Storage System at SLAC Wei Yang Andy Hanushevsky Alex Sim Junmin Gu.
09-Apr-2008Fabrizio Furano - Scalla/xrootd status and features1.
DCache/XRootD Dmitry Litvintsev (DMS/DMD) FIFE workshop1Dmitry Litvintsev.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
COMP1321 Digital Infrastructure Richard Henson March 2016.
New Features of Xrootd SE Wei Yang US ATLAS Tier 2/Tier 3 meeting, University of Texas, Arlington,
A. Sim, CRD, L B N L 1 Production Data Management Workshop, Mar. 3, 2009 BeStMan and Xrootd Alex Sim Scientific Data Management Research Group Computational.
BeStMan/DFS support in VDT OSG Site Administrators workshop Indianapolis August Tanya Levshina Fermilab.
a brief summary for users
File System Implementation
dCache “Intro” a layperson perspective Frank Würthwein UCSD
SLAC National Accelerator Laboratory
File System Implementation
Ákos Frohner EGEE'08 September 2008
Scalla/XRootd Advancements
Specialized Cloud Architectures
Presentation transcript:

Scalla In’s & Out’s xrootdcmsd xrootd /cmsd Andrew Hanushevsky SLAC National Accelerator Laboratory OSG Administrator’s Work Shop Stanford University/SLAC 13-November-08

13-November-082: Goals A good understanding of xrootd xrootd structure cmsd Clustering & cmsd How configuration directives apply Cluster interconnections How it really works The oss Storage System & the cacheFS Scalla SRM & Scalla FUSExroodFScnsd Position of FUSE, xroodFS, cnsd The big picture

13-November-083: Scalla What is Scalla? SCA Structured Cluster Architecture for LLA Low Latency Access xrootd Low Latency Access to data via xrootd servers Protocol includes high performance features cmsd Structured Clustering provided by cmsd servers Exponentially scalable and self organizing

13-November-084: xrootd What is xrootd? A specialized file server Provides access to arbitrary files Allows reads/writes with offset/length Think of it as a specialized NFS server Then why not use NFS? Does not scale well Can’t map a single namespace on all the servers xrootd All xrootd servers can be clustered to look like “one” server

13-November-085: xrootd The xrootd Server Process Manager Protocol Implementation Logical File System Physical Storage System Clustering Interface xrootdProcess xrootdServer

13-November-086: xrootd How Is xrootd Clustered? cmsd By a management service provided by cmsd processes xrootd Oversees the health and name space on each xrootd server Maps file names to the servers that have the file xrootd Informs client via an xrootd server about the file’s location All done in real time without using any databases xrootdcmsd Each xrootd server process talks to a local cmsd process Communicate over a Unix named (i.e., file system) socket cmsdcmsd Local cmsd’s communicate to a manager cmsd elsewhere Communicate over a TCP socket role Each process has a specific role in the cluster

13-November-087: xrootdcmsd xrootd & cmsd Relationships Clustering Interface xrootdProcess xrootdServer cmsdProcess cmsd Manager cmsd elsewhere

13-November-088: How Are The Relationship Described? Relationships described in a configuration file You normally need only one such file for all servers But all servers need such a file The file tells each component its role & what to do Done via component specific directives One line per directive component_namedirective [ parameters ] who it applies to what to do all | acc | cms | sec| ofs | oss | xrd | xrootd

13-November-089: Directives versus Components xrd.directive xrootd.directive ofs.directive oss.directive all.directive cms.directive xrootd.fslib /…/XrdOfs.so placed in the configuration file

13-November-0810: Where Can I Learn More? Start With Scalla Configuration File Syntax System related parts have their own manuals Xrd/XRootd Configuration Reference xrd.xrootd. Describes xrd. and xrootd. directives Scalla Open File System & Open Storage System Configuration Reference ofs.oss. Describes ofs. and oss. directives Cluster Management Service Configuration Reference cms. Describes cms. directives all. Every manual tells you when you must use all.

13-November-0811: The Bigger Picture xrootd cmsd xrootd cmsd xrootd cmsd Data Server Node a.slac.stanford.edu Manager Node x.slac.stanford.edu Data Server Node b.slac.stanford.edu Which one do clients connect to? all.role server all.role manager if x.slac.stanford.edu all.manager x.slac.stanford.edu 1213 Configuration File: Note: All processes can be started in any order!

13-November-0812: Then How Do I Get To A Server? xrootd Clients always connect to manager’s xrootd Client’s think this is the right file server But the manager only pretends to be a file server Clients really don’t know the difference Manager finds out which server has client’s file Then magic happens…

13-November-0813: The Magic xrootd cmsd xrootd cmsd xrootd cmsd Data Server Node a.slac.stanford.edu Manager Node x.slac.stanford.edu Data Server Node b.slac.stanford.edu client open(“/foo”) Locate /foo Goto a open(“/foo”) /foo Node a has /foo Have /foo? I have /foo! Is Redirection!

13-November-0814: Request Redirection Most requests redirected to the “right” server Provides point-to-point I/O Redirection for existing files  few milliseconds 1 st time Results cached; subsequent redirection is done in microseconds Allows load balancing cms.perfcms.sched Many options; see the cms.perf & cms.sched directives Cognizant of failing servers Can automatically choose another working server cms.delay See the cms.delay directive

13-November-0815: Pause For Some Terminology Manager The processes whose assigned role is “manager” all.role manager Typically this is a distinguished node Redirector xrootd The xrootd process on the manager’s node Server The processes whose assigned role is “server” all.role server This is the end-point node that actually supplies the file data

13-November-0816: How Many Managers Can I Have? Up to eight but usually you’ll want only two Avoids single-point hardware and software failures cmsd Redirectors automatically cross-connect to all of the manager cmsd’s cmsd Servers automatically connect to all of the manager cmsd’s xrootd Clients randomly pick one of the working manager xrootd’s cmsd Redirectors algorithmically pick one of the working cmsd’s Allows you load balance manager nodes if you wish See the all.manager directive This also allows you to do serial restarts Eases administrative maintenance The cluster goes into safe mode if all the managers die or if too many servers die

13-November-0817: A Robust Configuration xrootd cmsd xrootd cmsd xrootd cmsd Data Server Node a.slac.stanford.edu Central Manager Node x.slac.stanford.edu Data Server Node b.slac.stanford.edu xrootd cmsd Central Manager Node y.slac.stanford.edu all.role server all.role manager if x.slac.stanford.edu all.manager x.slac.stanford.edu:1213 all.role manager if y.slac.stanford.edu all.manager y.slac.stanford.edu:1213 Redirectors

13-November-0818: Don’t forget the plus! How Do I Handle Multiple Managers? Ask your network administrator to… Assign the manager IP addresses to a common host name xy.domain.edu x.domain.edu, y.domain.edu Make sure that DNS load balancing does not apply! Use xy.domain.edu everywhere instead of x or y root://x.domain.edu,y.domain.edu// root://xy.domain.edu// The client will choose one of x or y In the configuration file do one of the following all.manager x.domain.edu:1213 all.manager y.domain.edu:1213 all.manager xy.domain.edu+:1213 or

13-November-0819: A Quick Recapitulation The system is highly structured xrootd Server xrootd’s provide the data xrootd Manager xrootd’ provide the redirection cmsd The cmsd’s manage the cluster Locate files and monitor the health of all the servers Client’s initially contact a redirector They are then redirected to a data server The structure is described by the config file Usually the same one is used everywhere

13-November-0820: Things You May Want To Do Automatically restart failing processes Best done via a crontab entry running a restart script xrootdcmsd Most people use root but you can use the xrootd/cmsd’s uid cmsd Renice server cmsd’s As root: renice –n -10 –p cmsd_pid cmsd Allows cmsd to get CPU even when the system is busy Can be automated via the start-up script One reason why most people use root for start/restart

13-November-0821: Things You Really Need To Do Plan for log and core file management /var/adm/xrootd/core & /var/adm/xrootd/logs Log rotation can be automated via command line options Over-ride the default administrative path See the all.adminpath directive Place where Unix named sockets are created /tmp is the (bad) default consider using /var/adm/xrootd/admin Plan on configuring your storage space & SRM xrootd These are xrootd specific ofs & oss options FUSEcnsd SRM requires you run FUSE, cnsd, and BestMan

13-November-0822: oss.usage all.export oss.cache Server Storage Configuration The questions to ask… What paths do I want to export (i.e., make available)? Will I have more than one file system on the server? Will I be providing SRM access? Will I need to support SRM space tokens?

13-November-0823: Exporting Paths Use the all.export directive xrootd Used by xrootd to allow access to exported paths cmsd Used by cmsd to search for files in exported paths Many options available r/o and r/w are the two most common Refer to the manual Scalla Open File System & Open Storage System Configuration Reference

13-November-0824: But My Exports Are Mounted Elsewhere! Common issue Say you need to mount your file system on /myfs But you want to export /atlas within /myfs What to do? Use the oss.localroot directive Only the oss component needs to know about this oss.localroot /myfs all.export /atlas Makes /atlas a visible path but internally always prefixes it with /myfs So, open(“/atlas/foo”) actually opens “/myfs/atlas/foo”

13-November-0825: Multiple File Systems The oss allows you to aggregate partitions Each partition is mounted as a separate file system An exported path can refer to all the partitions The oss automatically handles it by creating symlinks File name in /atlas is a symlink to an actual file in /mnt1 or /mnt2 /mnt1 /mnt2 /atlas Mounted Partitions hold file data File system used to hold exported file paths symlink oss.cache public /mnt1 xa oss.cache public /mnt2 xa all.export /atlas The oss CacheFS

13-November-0826: OSS CacheFS Logic Example Client creates a new file “/atlas/myfile” The oss selects a suitable partition Searches for space in /mnt1 and /mnt2 using LRU order Creates a null file in the selected partition Let’s call it /mnt1/public/00/file0001 Creates two symlinks /atlas/myfile /mnt1/public/00/file0001 /mnt/public/00/file0001.pfn /atlas/myfile Client can then write the data

13-November-0827: Why Use The oss CacheFS? No need if you can have one file system Use the OS volume manager if you have one and Not worried about large logical partitions or fsck time However, We use the CacheFS to support SRM space tokens Done by mapping tokens to virtual or physical partitions The oss supports both

13-November-0828: SRM Static Space Token Refresher Encapsulates fixed space characteristics Type of space E.g., Permanence, performance, etc. Implies a specific quota Using a particular arbitrary name E.g., atlasdatadisk, atlasmcdisk, atlasuserdisk, etc. Typically used to create new files Think of it as a space profile

13-November-0829: Partitions as a Space Token Paradigm Disk partitions map well to SRM space tokens A set of partitions embody a set of space attributes Performance, quota, etc. A static space token defines a set of space attributes Partitions and static space tokens are interchangeable We take the obvious step Use oss CacheFS partitions for SRM space tokens Simply map space tokens on a set of partitions The oss CacheFS supports real and virtual partitions So you really don’t need physical partitions here

13-November-0830: Virtual vs. Real Partitions Simple two step process Define your real partitions (one or more) These are file system mount-points Map virtual partitions on top of real ones Virtual partitions can share real partitions By convention, virtual partition names equal static token names Yields implicit SRM space token support oss.cache atlasdatadisk /store1 xa oss.cache atlasmcdisk /store1 xa oss.cache atlasuserdisk /store2 xa Virtual Partition Name Real Partition Mount Two virtual partitions Sharing the same Physical partition

13-November-0831: Space Tokens vs. Virtual Partitions Partitions selected by virtual partition name Configuration file: New files “cgi-tagged” with space token name root://host:1094//atlas/mcdatafile?cgroup=atlasmcdisk The default is “public” But space token names equal virtual partition names File will be allocated in the desired real/virtual partition oss.cache atlasdatadisk /store1 xa oss.cache atlasmcdisk /store1 xa oss.cache atlasuserdisk /store2 xa

13-November-0832: Virtual vs. Real Partitions Non-overlapping virtual partitions (R=V) A real partition represents a hard quota Implies space token gets fixed amount of space Overlapping virtual partitions (R  V) Hard quota applies to multiple virtual partitions Implies space token gets an undetermined amount of space Need usage tracking and external quota management

13-November-0833: Partition Usage Tracking The oss tracks usage by partition Automatic for real partitions Configurable for virtual partitions oss.usage {nolog | log dirpath} Since Virtual Partitions  SRM Space Tokens Usage is also automatically tracked by space token POSIX getxattr() returns usage information See Linux man page

13-November-0834: Partition Quota Management Quotas applied by partition Automatic for real partitions Must be enabled for virtual partitions oss.usage quotafile filepath Currenty, quotas are not enforced by the oss POSIX getxattr() returns quota information FUSExrootdFS Used by FUSE/xrootdFS to enforce quotas Required to run a full featured SRM

13-November-0835: The Quota File Lists quota for each virtual partition Hence, also a quota for each static space token Simple multi-line format vpname nnnn[k | m | g | t]\n vpname’s are in 1-to-1 correspondence with space token names The oss re-reads it whenever it changes FUSExrootdFS Useful only for FUSE/xrootdFS Quotas need to apply to the whole cluster

13-November-0836: Considerations Files cannot be easily reassigned space tokens Must manually “move” file across partitions Can always get original space token name Use file-specific getxattr() call Quotas for virtual partitions are “soft” Time causality prevents a real hard limit Use real partitions if hard limit needed

13-November-0837: Scalla SRM & Scalla: The Big Issue Scalla Scalla implements a distributed name space Very scalable and efficient Sufficient for data analysis SRM needs a single view of the complete name space This requires deploying additional components cnsd Composite Name Space Daemon (cnsd) Provides the complete name space FUSExrootdFS FUSE/xrootdFS Provides the single view via a file system interface Compatible with all stand-alone SRM’s (e.g., BestMan & StoRM)

13-November-0838: The Composite Name Space xrootd A new xrootd instance is used to maintain the complete name space for the cluster Only holds the full paths & file sizes, no more Normally runs on one of the manager nodes cnsd The cnsd needs to run on all the server nodes xrootd Captures xrootd name space requests (e.g., rm) xrootd Re-Issues the request to the new xrootd instance This is the cluster’s composite name space Composite because each server node adds to the name space There is no pre-registration of names; it all happens on-the-fly

13-November-0839: Composite Name Space Implemented Redirector Name Space DataServers Manager cnsd ofs.forward 3way myhost:2094 mkdir mv rm rmdir trunc ofs.notify closew create |/opt/xrootd/bin/cnsd xrootd.redirect myhost:2094 dirlist create/trunc mkdir mv rm rmdir opendir() refers to the directory structure maintained at myhost:2094 Client opendir() Not needed because redirector has access

13-November-0840: Some Caveats Name space is reasonably accurate Usually sufficient for SRM operations cnsd cnsd’s do log events to circumvent transient failures xrootd The log is replayed when the name space xrootd recovers But, the log is not infinite Invariably inconsistencies will arise The composite name space can be audited Means comparing and resolving multiple name spaces Time consuming in terms of elapsed time But can happen while the system is running Tools to do this are still under development Consider contributing such software

13-November-0841: The Single View Now that there is a composite cluster name space we need an SRM-compatible view The easiest way is to use a file system view BestMan and StoRM actually expect this FUSE The additional component is FUSE

13-November-0842: FUSE What is FUSE FUse Filesystem in Userspace Implements a file system as a user space program Linux 2.4 and 2.6 only Refer to FUSE xrootd Can use FUSE to provide xrootd access Looks like a mounted file system xrootdFS We call it xrootdFS Two versions currently exist Wei Yang at SLAC (packaged with VDT) Andreas Peters at CERN (packaged with Castor)

13-November-0843: xrootdFS FUSExrootd xrootdFS (Linux/FUSE/xrootd) Redirector xrootd:1094 Name Space xroot:2094RedirectorHost ClientHost opendir create mkdir mv rm rmdir xrootd POSIX Client Kernel User Space SRM POSIX File System Interface FUSE FUSE/Xroot Interface Should run cnsd on servers to capture non-FUSE events

13-November-0844: xrootdFS SLAC xrootdFS Performance Sun V20z RHEL4 2x 2.2Ghz AMD Opteron 4GB RAM 1Gbit/sec Ethernet Client VA Linux 1220 RHEL3 2x 866Mhz Pentium 3 1GB RAM 100Mbit/sec Ethernet Unix dd, globus-url-copy & uberftp 5-7MB/sec with 128KB I/O block size Unix cp 0.9MB/sec with 4KB I/O block size Conclusion: Do not use it for data transfers!

13-November-0845: More Caveats FUSE FUSE must be administratively installed Requires root access Difficult if many machines (e.g., batch workers) Easier if it only involves an SE node (i.e., SRM gateway) Performance is limited FUSE Kernel-FUSE interactions are not cheap FUSE CERN modified FUSE shows very good transfer performance Rapid file creation (e.g., tar) is limited Recommend that it be kept away from general users

13-November-0846: Putting It All Together xrootd cmsd xrootd cmsd Data Server Nodes Manager Node SRM Node BestMangridFTP xrootd xrootdFS xrootd Basic xrootd Cluster + xrootd Name Space xrootd = LHC Grid Access cnsd + SRM Node xrootdFS (BestMan, xrootdFS, gridFTP) + cnsd

13-November-0847: Acknowledgements Software Contributors CERN: Derek Feichtinger, Fabrizio Furano, Andreas Peters Fermi: Tony Johnson (Java) Root: Gerri Ganis, Bertrand Bellenot SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger Operational Collaborators BNL, INFN, IN2P3 Partial Funding US Department of Energy Contract DE-AC02-76SF00515 with Stanford University