Shared Access to Experiments’ Software

Slides:



Advertisements
Similar presentations
Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,
Discussion Week 7 TA: Kyle Dewey. Overview Midterm debriefing Virtual memory Virtual Filesystems / Disk I/O Project #3.
Distributed Storage March 12, Distributed Storage What is Distributed Storage?  Simple answer: Storage that can be shared throughout a network.
Chapter 101 Virtual Memory Chapter 10 Sections and plus (Skip:10.3.2, 10.7, rest of 10.8)
Coda file system: Disconnected operation By Wallis Chau May 7, 2003.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.
Ext3 Journaling File System “absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionality” chadd williams.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
PRASHANTHI NARAYAN NETTEM.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
By Matthew Smith, John Allred, Chris Fulton. Requirements Relocation Protection Sharing Logical Organization Physical Organization.
1 File System Implementation Operating Systems Hebrew University Spring 2010.
Networked File System CS Introduction to Operating Systems.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Demand Paging Reference Reference on UNIX memory management
Storage Systems CSE 598d, Spring 2007 OS Support for DB Management DB File System April 3, 2007 Mark Johnson.
I/O Software CS 537 – Introduction to Operating Systems.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
An Introduction to GPFS
BIG DATA/ Hadoop Interview Questions.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Valencia Cluster status Valencia Cluster status —— Gang Qin Nov
Servizi core INFN Grid presso il CNAF: setup attuale
CS161 – Design and Architecture of Computer
Dynamic Extension of the INFN Tier-1 on external resources
Module 12: I/O Systems I/O hardware Application I/O Interface
Jonathan Walpole Computer Science Portland State University
WP18, High-speed data recording Krzysztof Wrona, European XFEL
CS161 – Design and Architecture of Computer
Chapter 11: File System Implementation
Diskpool and cloud storage benchmarks used in IT-DSS
Chapter 12: File System Implementation
Chapter 2: System Structures
Modeling Page Replacement Algorithms
STORM & GPFS on Tier-2 Milan
Swapping Segmented paging allows us to have non-contiguous allocations
Storage Virtualization
NFS and AFS Adapted from slides by Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift.
Experience with GPFS and StoRM at the INFN Tier-1
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 9: Virtual-Memory Management
Page Replacement.
CS703 - Advanced Operating Systems
Modeling Page Replacement Algorithms
Distributed File Systems
Distributed File Systems
Virtual Memory Overcoming main memory size limitation
Distributed File Systems
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
CSE 451 Fall 2003 Section 11/20/2003.
Virtual Memory: Working Sets
Distributed File Systems
Distributed File Systems
Lecture Topics: 11/20 HW 7 What happens on a memory reference Traps
CS 295: Modern Systems Organizing Storage Devices
Presentation transcript:

Shared Access to Experiments’ Software Problems and solutions Vladimir Sapunenko, INFN-CNAF Alessandro Brunengo, INFN, Genova 06/12/2018 WS CCR-GRID, Palau

What is “area software”? Shared file system dedicated for distribution of experiment specific software Main requirements: Managed by VOs Shared among all UIs and WNs Fast and Reliable 06/12/2018 WS CCR-GRID, Palau

Experiments’ software Typical numbers Tier1 site: ~10 different experiments ~5-7 releases for each experiment; ~200K-2M files/exp; ~200GB of disk space/exp Frequent updates (~weekly) Need to distribute to ~1000 WNs Tier2 site: 1-2 different experiments ~3 releases for each experiment; ~100K-500K files/exp; Need to distribute to ~100 WNs 06/12/2018 WS CCR-GRID, Palau

A bit of history (CNAF) Pre-GPFS era: 4 stand-alone NAS, each one serving a group of VOs automount problems: Management and control of 4 independent systems. Down of one server kicks-off several VOs Frequent “Stale NFS file handle” GPFS era: let’s put everything on GPFS! First round: 2 diskservers + 2x2TB SAN-attached disk Fully redundant configuration! 06/12/2018 WS CCR-GRID, Palau

SAN 2TB 2TB 06/12/2018 WS CCR-GRID, Palau

several problems emerged when the massive production started Slow response to “ls” from UI Job initialization takes too long Access to all file systems blocked What’s going on??? From "mmfsadm dump waiters" output it seems we have a deadlock in the distributed token manager: 0x8B8E938 waiting 3928.850049000 seconds, SharedHashTab fetch handler: on ThCond 0x90A6A184 (0xF926A184) (LkObj), reason 'change_lock_shark waiting to set acquirePending flag' 0x8C23C10 waiting 9222.268972000 seconds, MMFS group recovery phase 3: on ThCond 0x8C36BCC (0x8C36BCC) (MsgRecord), reason 'waiting for RPC replies' for sgmMsgExeTMPhase 06/12/2018 WS CCR-GRID, Palau

Why it happens? Reasons: Can overcome this problem? Applications accessing too many files cache size is too low Some of frequently accessed files > 1MB in size Can overcome this problem? Yes! Just need to find correct set of tuning parameters  06/12/2018 WS CCR-GRID, Palau

Use of “stat” system call From strace of a typical Atlas job we saw a lot of stat system calls every job is checking the whole directory structure of shared area (could be meaningful only if that area differs from node to node) This fact was also noted by GPFS support. Here are their comments: “The stat command is file system extensive and causes GPFS to read up the entire directory structure to the top.” “If you have all client nodes executing this all the time, then of course we will have token contention, an application such as this is just not correct programming.” 06/12/2018 WS CCR-GRID, Palau

Debug session: /usr/lpp/mmfs/bin/mmfsadm dump fs and look for a line like this:    OpenFile counts: total created 1226 (in use 1219, free 7)      cached 1219, currently open 784+87, cache limit 200 (min 10, max 200), eff limit 881 In this specific case, the number of files in cache is 1219, of which effectivly opened are 784+87. The limit is 200, effettive 881 (784+87 +10). In fact, files in use can’t be canceled from the cache. The “garbage collection”(cache cleaning) is being executed in asyncronous mode by gpfsSwapdKproc daemon. If this daemon consumes CPU, it means it working continuesly canceling data from the cache 06/12/2018 WS CCR-GRID, Palau

How GPFS caching works random reads: Sequential access: Parameters: the blocks of the file will be kept in the pagepool using an LRU (Least Recently Used) algorithm. Sequential access: flush used buffers out of the pagepool as soon as they are used to prevent large files from completely wiping out any cached random or small files. Parameters: seqDiscardThreshhold = 1M (default) controls the size of "small files" that are left in the pagepool. Blocks of files larger than these thresholds will not be kept in the pagepool after use. writeBehindThreshhold = 512K for sequential writing. If you want to keep all sequential pages read or written in the pagepool, then set these thresholds to a very large size. mmchconfig seqDiscardThreshhold=999G, writeBehindThreshhold=999G -i Then if you want to cache the entire file, do one sequential read of the file, and then all random access after that will use the cached blocks (assuming it all fits in the pagepool). 06/12/2018 WS CCR-GRID, Palau

GPFS tuning parameters Pagepool: portion of memory (pinned!) dedicated for GPFS operations Can be changed on-the-fly by “mmchconfig pagepool=nM -i” default 64M On GPFS client = Max cache size maxFilesToCache (number of files to keep in cache) default = 1000 maxStatCache (number of inodes to keep in cache) default = 4*maxFilesToCache=4000 # of Token managers: Each manager node can handle 600K tokens on 64bit kernel (300K on 32bit) using max 512M of memory from pagepool #managers * 600K  > #nodes* (maxFilesToCache+maxStatCache) have them enough so that taking one manager node down to install maintenance does not cause overflow of the others you can increase the size of the memory for tokens on the existing manager nodes:     mmchconfig tokenMemLimit=nG -N managernodes 06/12/2018 WS CCR-GRID, Palau

Tunning session Define MFTC Adjust #manager Tune managers How many file needs to be cached? Lsof or (even better) strace can give you an idea Ex.: an Atlas job opens 1800 files, CMS – 600, CDF - 100 Up to 5 jobs can be executed on the same node We want to keep in cache ~ 5000 files Adjust #manager How many nodes can be dedicated for this task? Ex.: I don’t have more then 4 (64bit) servers fot this  #managers * 600K  > #nodes* (maxFilesToCache+maxStatCache) 4*600K=2400K < 600*(5K+4*5K)=15000K ! I need more manager nodes, or Tune managers Increase (if needed) memory dedicated for token management I need to increase space for tokens on managers by factor 8: mmchconfig tokenMemLimit=4G -N managernodes 06/12/2018 WS CCR-GRID, Palau

Other considerations With a large number of nodes working on a few NSD server nodes, it is a good idea not to use the NSD servers as manager or quorum nodes. The big blocks being transferred interfere on the network with all the small token and lease renewal requests Be sure to tune TCP to have large window sizes for data block transfers on both NSD servers and client nodes. Set the TCP send/recv buffers larger than the largest filesystem blocksize. Be sure Jumbo Frames are being used on the ethernet (not really important in this context) 06/12/2018 WS CCR-GRID, Palau

GPFS or not GPFS? Seems to be to complicated for use as software distribution Locks or tokens management in GPFS oriented for sharing files in read/write and more strict then needed in this case. Even if pagepool is big enough, it can’t keep more then 5*maxFilesToCache NFS can use all available memory for caching and seems to have no limits for number of files 06/12/2018 WS CCR-GRID, Palau

Checking if it works… Test performed on different configurations GPFS access to file systems with data and metadata NSD (RAID6 on SATA disks) GPFS access to file system with metadata on dedicated NSD (RAID6 on SAS disks) with standard cache configuration on WN with increased pagepool (4 GB) and maxFilesToCache (50000) NFS access to a GPFS file system the NFS server is configured with increased pagepool and maxFilesToCache 06/12/2018 WS CCR-GRID, Palau

Job description Test standard per la ricostruzione delle informazioni di trigger di Atlas Il job inizializza ed esegue gli algoritmi di trigger ed alcuni algoritmi off-line L’esecuzione dei test per la misura e’ limitata ad un piccolo numero di eventi ringraziamenti a Carlo Schiavi per la collaborazione 06/12/2018 WS CCR-GRID, Palau

File size distribution File accessed identified using strace 06/12/2018 WS CCR-GRID, Palau

Execution time and efficiency 06/12/2018 WS CCR-GRID, Palau

Concurrent runs 06/12/2018 WS CCR-GRID, Palau

CPU in GPFS access is 60% used by mmfsd process: user process waits for I/O from GPFS and do not use CPU (~30% idle) 06/12/2018 WS CCR-GRID, Palau

CPU in NFS access is fully available to the user process: mmfsd do not use CPU at all 06/12/2018 WS CCR-GRID, Palau

Adopted solution for Tier1: C-NFS HA Clustered NFS based on GPFS GPFS provides High Performance and Fault Tolerant File System C-NFS adds HA layer using DNS load balancing, Virtual IP migration and outstanding locks recovery in case of single server failure. Read-Only on all WNs NFS mount parms:(ro,nosuid,soft,intr,rsize=32768,wsize=32768) Separate cluster for Software Build: FS mounted in read/write using GPFS 6 nodes (32bit and 64bit) Separate queues in LSF (based on user role) 06/12/2018 WS CCR-GRID, Palau

WNs C-NFS Virtual IP Servers GPFS SAN 2TB 2TB Disks C-NFS Name (DNS balanced) Virtual IP IP1 IP2 IP3 IP4 Servers GPFS SAN 2TB 2TB Disks 06/12/2018 WS CCR-GRID, Palau

Network traffic NFS vs. GPFS On week 11 we switched back to GPFS for a few days Normal network traffic with C-NFS 06/12/2018 WS CCR-GRID, Palau

Other solutions? AFS: FS-Cache: Pros: Conts: caching on local disk Conts: Need authentication Difficult to deploy Very good solution where AFS already deployed In use at Trieste and some other sites FS-Cache: FS-Cache is a kernel facility by which a network file system or other service can cache data locally trading disk space to gain performance improvements for access to slow networks There’s a mailing list available for FS-Cache specific discussions: mailto:linux-cachefs@redhat.com Great as idea, but not in the kernel mainstream yet (expected to come with kervel v.2.6.30) 06/12/2018 WS CCR-GRID, Palau