Shared Access to Experiments’ Software Problems and solutions Vladimir Sapunenko, INFN-CNAF Alessandro Brunengo, INFN, Genova 06/12/2018 WS CCR-GRID, Palau
What is “area software”? Shared file system dedicated for distribution of experiment specific software Main requirements: Managed by VOs Shared among all UIs and WNs Fast and Reliable 06/12/2018 WS CCR-GRID, Palau
Experiments’ software Typical numbers Tier1 site: ~10 different experiments ~5-7 releases for each experiment; ~200K-2M files/exp; ~200GB of disk space/exp Frequent updates (~weekly) Need to distribute to ~1000 WNs Tier2 site: 1-2 different experiments ~3 releases for each experiment; ~100K-500K files/exp; Need to distribute to ~100 WNs 06/12/2018 WS CCR-GRID, Palau
A bit of history (CNAF) Pre-GPFS era: 4 stand-alone NAS, each one serving a group of VOs automount problems: Management and control of 4 independent systems. Down of one server kicks-off several VOs Frequent “Stale NFS file handle” GPFS era: let’s put everything on GPFS! First round: 2 diskservers + 2x2TB SAN-attached disk Fully redundant configuration! 06/12/2018 WS CCR-GRID, Palau
SAN 2TB 2TB 06/12/2018 WS CCR-GRID, Palau
several problems emerged when the massive production started Slow response to “ls” from UI Job initialization takes too long Access to all file systems blocked What’s going on??? From "mmfsadm dump waiters" output it seems we have a deadlock in the distributed token manager: 0x8B8E938 waiting 3928.850049000 seconds, SharedHashTab fetch handler: on ThCond 0x90A6A184 (0xF926A184) (LkObj), reason 'change_lock_shark waiting to set acquirePending flag' 0x8C23C10 waiting 9222.268972000 seconds, MMFS group recovery phase 3: on ThCond 0x8C36BCC (0x8C36BCC) (MsgRecord), reason 'waiting for RPC replies' for sgmMsgExeTMPhase 06/12/2018 WS CCR-GRID, Palau
Why it happens? Reasons: Can overcome this problem? Applications accessing too many files cache size is too low Some of frequently accessed files > 1MB in size Can overcome this problem? Yes! Just need to find correct set of tuning parameters 06/12/2018 WS CCR-GRID, Palau
Use of “stat” system call From strace of a typical Atlas job we saw a lot of stat system calls every job is checking the whole directory structure of shared area (could be meaningful only if that area differs from node to node) This fact was also noted by GPFS support. Here are their comments: “The stat command is file system extensive and causes GPFS to read up the entire directory structure to the top.” “If you have all client nodes executing this all the time, then of course we will have token contention, an application such as this is just not correct programming.” 06/12/2018 WS CCR-GRID, Palau
Debug session: /usr/lpp/mmfs/bin/mmfsadm dump fs and look for a line like this: OpenFile counts: total created 1226 (in use 1219, free 7) cached 1219, currently open 784+87, cache limit 200 (min 10, max 200), eff limit 881 In this specific case, the number of files in cache is 1219, of which effectivly opened are 784+87. The limit is 200, effettive 881 (784+87 +10). In fact, files in use can’t be canceled from the cache. The “garbage collection”(cache cleaning) is being executed in asyncronous mode by gpfsSwapdKproc daemon. If this daemon consumes CPU, it means it working continuesly canceling data from the cache 06/12/2018 WS CCR-GRID, Palau
How GPFS caching works random reads: Sequential access: Parameters: the blocks of the file will be kept in the pagepool using an LRU (Least Recently Used) algorithm. Sequential access: flush used buffers out of the pagepool as soon as they are used to prevent large files from completely wiping out any cached random or small files. Parameters: seqDiscardThreshhold = 1M (default) controls the size of "small files" that are left in the pagepool. Blocks of files larger than these thresholds will not be kept in the pagepool after use. writeBehindThreshhold = 512K for sequential writing. If you want to keep all sequential pages read or written in the pagepool, then set these thresholds to a very large size. mmchconfig seqDiscardThreshhold=999G, writeBehindThreshhold=999G -i Then if you want to cache the entire file, do one sequential read of the file, and then all random access after that will use the cached blocks (assuming it all fits in the pagepool). 06/12/2018 WS CCR-GRID, Palau
GPFS tuning parameters Pagepool: portion of memory (pinned!) dedicated for GPFS operations Can be changed on-the-fly by “mmchconfig pagepool=nM -i” default 64M On GPFS client = Max cache size maxFilesToCache (number of files to keep in cache) default = 1000 maxStatCache (number of inodes to keep in cache) default = 4*maxFilesToCache=4000 # of Token managers: Each manager node can handle 600K tokens on 64bit kernel (300K on 32bit) using max 512M of memory from pagepool #managers * 600K > #nodes* (maxFilesToCache+maxStatCache) have them enough so that taking one manager node down to install maintenance does not cause overflow of the others you can increase the size of the memory for tokens on the existing manager nodes: mmchconfig tokenMemLimit=nG -N managernodes 06/12/2018 WS CCR-GRID, Palau
Tunning session Define MFTC Adjust #manager Tune managers How many file needs to be cached? Lsof or (even better) strace can give you an idea Ex.: an Atlas job opens 1800 files, CMS – 600, CDF - 100 Up to 5 jobs can be executed on the same node We want to keep in cache ~ 5000 files Adjust #manager How many nodes can be dedicated for this task? Ex.: I don’t have more then 4 (64bit) servers fot this #managers * 600K > #nodes* (maxFilesToCache+maxStatCache) 4*600K=2400K < 600*(5K+4*5K)=15000K ! I need more manager nodes, or Tune managers Increase (if needed) memory dedicated for token management I need to increase space for tokens on managers by factor 8: mmchconfig tokenMemLimit=4G -N managernodes 06/12/2018 WS CCR-GRID, Palau
Other considerations With a large number of nodes working on a few NSD server nodes, it is a good idea not to use the NSD servers as manager or quorum nodes. The big blocks being transferred interfere on the network with all the small token and lease renewal requests Be sure to tune TCP to have large window sizes for data block transfers on both NSD servers and client nodes. Set the TCP send/recv buffers larger than the largest filesystem blocksize. Be sure Jumbo Frames are being used on the ethernet (not really important in this context) 06/12/2018 WS CCR-GRID, Palau
GPFS or not GPFS? Seems to be to complicated for use as software distribution Locks or tokens management in GPFS oriented for sharing files in read/write and more strict then needed in this case. Even if pagepool is big enough, it can’t keep more then 5*maxFilesToCache NFS can use all available memory for caching and seems to have no limits for number of files 06/12/2018 WS CCR-GRID, Palau
Checking if it works… Test performed on different configurations GPFS access to file systems with data and metadata NSD (RAID6 on SATA disks) GPFS access to file system with metadata on dedicated NSD (RAID6 on SAS disks) with standard cache configuration on WN with increased pagepool (4 GB) and maxFilesToCache (50000) NFS access to a GPFS file system the NFS server is configured with increased pagepool and maxFilesToCache 06/12/2018 WS CCR-GRID, Palau
Job description Test standard per la ricostruzione delle informazioni di trigger di Atlas Il job inizializza ed esegue gli algoritmi di trigger ed alcuni algoritmi off-line L’esecuzione dei test per la misura e’ limitata ad un piccolo numero di eventi ringraziamenti a Carlo Schiavi per la collaborazione 06/12/2018 WS CCR-GRID, Palau
File size distribution File accessed identified using strace 06/12/2018 WS CCR-GRID, Palau
Execution time and efficiency 06/12/2018 WS CCR-GRID, Palau
Concurrent runs 06/12/2018 WS CCR-GRID, Palau
CPU in GPFS access is 60% used by mmfsd process: user process waits for I/O from GPFS and do not use CPU (~30% idle) 06/12/2018 WS CCR-GRID, Palau
CPU in NFS access is fully available to the user process: mmfsd do not use CPU at all 06/12/2018 WS CCR-GRID, Palau
Adopted solution for Tier1: C-NFS HA Clustered NFS based on GPFS GPFS provides High Performance and Fault Tolerant File System C-NFS adds HA layer using DNS load balancing, Virtual IP migration and outstanding locks recovery in case of single server failure. Read-Only on all WNs NFS mount parms:(ro,nosuid,soft,intr,rsize=32768,wsize=32768) Separate cluster for Software Build: FS mounted in read/write using GPFS 6 nodes (32bit and 64bit) Separate queues in LSF (based on user role) 06/12/2018 WS CCR-GRID, Palau
WNs C-NFS Virtual IP Servers GPFS SAN 2TB 2TB Disks C-NFS Name (DNS balanced) Virtual IP IP1 IP2 IP3 IP4 Servers GPFS SAN 2TB 2TB Disks 06/12/2018 WS CCR-GRID, Palau
Network traffic NFS vs. GPFS On week 11 we switched back to GPFS for a few days Normal network traffic with C-NFS 06/12/2018 WS CCR-GRID, Palau
Other solutions? AFS: FS-Cache: Pros: Conts: caching on local disk Conts: Need authentication Difficult to deploy Very good solution where AFS already deployed In use at Trieste and some other sites FS-Cache: FS-Cache is a kernel facility by which a network file system or other service can cache data locally trading disk space to gain performance improvements for access to slow networks There’s a mailing list available for FS-Cache specific discussions: mailto:linux-cachefs@redhat.com Great as idea, but not in the kernel mainstream yet (expected to come with kervel v.2.6.30) 06/12/2018 WS CCR-GRID, Palau