Vladimir Sapunenko INFN-CNAF

Vladimir Sapunenko INFN-CNAF
GPFS: short overview of architecture, installation, configuration and troubleshooting Vladimir Sapunenko INFN-CNAF

What is GPFS? GPFS= General Parallel File System
IBM’s shared-disk, parallel cluster file system. Shared disks Switching fabric I/O nodes Cluster: nodes (tested), fast reliable communication, common admin domain Shared disk: all data and metadata on disk accessible from any node through disk I/O interface (i.e., "any to any“ connectivity) Parallel: data and metadata flows from all of the nodes to all of the disks in parallel RAS: reliability, accessibility, serviceability General: supports wide range of HPC application needs over a wide range of configurations Cluster: nodi (testati), comunicazioni veloci e affidabili, commune dominio amministrativo Shared disk (disco condiviso): tutti i dati e metadati su disco accessibile da qualsiasi nodo attraverso disco di I / O di interfaccia (ad esempio, "qualsiasi a qualsiasi" connettività) Parallel: i flussi di tutti dati e metadati dei nodi a tutti i dischi vanno in parallelo RAS: affidabilità, accessibilità, manutenzione General: supporta la vasta gamma di esigenze d’applicazione HCP in un'ampia gamma di configurazioni

Cluster configuration
Shared disk cluster (the most basic environment) SAN storage attached to all nodes in the cluster via FiberChannel (FC) All nodes interconnected via LAN Data flows via FC Control info transmitted via Ethernet Disco condiviso di cluster (l’ ambiente più elementare) -SAN stockage allegato a tutti i nodi del cluster attraverso fiberchannel (FC) -Tutti i nodi sono interconnessi tramite LAN -La data fluisce tramite FC -Controllo informazioni trasmesse tramite Ethernet

Network based block I/O
Block level I/O interface over network – Network Shared Disk (NSD) GPFS transparently handles I/O whether NSD or direct attachment is used Intra-cluster communications can be separated using dedicated interfaces Block level di I / O tramite interfaccia di rete – Network Shared Disk (NSD) GPFS manopola trasparentemente di I / O se e’ utilizzato NSD diretto o allegato Intra-cluster di comunicazione possono essere separati utilizzando interfacce dedicate

GPFS on DAS (Directly Attached Storage)
The most economic solution BUT there are some drawback: To protect data from a single server failure need replicate data and metadata -> reducing usable space by a half bus to disk ~320MB/s (2.5Gb/s) La soluzione più economica Ma vi sono alcuni svantaggi: -Per proteggere i dati da un fallimento di un unico server, c’e’ la necessità della replica dei dati e dei metadati -> riducendo lo spazio utilizzabile a meta’ -bus su disco ~ 320MB / s (2,5 GB / s)

GPFS on SAN (Storage Area Network)
The most natural way to use GPFS All servers accessing all disks Failure of a single server will only reduce available bandwidth to storage by factor N-1/N (N – number of diskservers) Bandwidth to disks can be easily increased to 8Gb/s (with 2 dual cannel FC2 or with 1 dual channel FC4 HBA) Il modo più naturale di usare GPFS Tutti i server accedono a tutti i dischi Guasto di un singolo server riduce solo la larghezza di banda disponibile per lo stoccaggio di fattore N-1 / n (n = numero di diskservers) Larghezza di banda per i dischi puo’ essere facilmente portata a 8GB / s (con 2 FC2 doppio canale o con 1 doppio canale FC4 HBA)

Clustered NFS GPFS Version 3.2 offers Clustered NFS which is a set of features and tools that support a high availability solution for NFS exporting file systems. some or all of the nodes in the GPFS cluster can export the same file systems to the NFS clients. (supported for systems with Linux only) The clustered NFS feature includes: Monitoring: Every node in the NFS cluster runs an NFS monitoring utility that monitors GPFS, the NFS server and networking components on the node. After failure detection the monitoring utility may invoke a failover. Failover: The automatic failover procedure transfers the NFS serving load from the failing node to another node in the NFS cluster. The failure is managed by the GPFS cluster, including NFS server IP address failover and file system lock and state recovery. Load balancing: Load balancing is IP address based. The IP address is the load unit that can be moved from one node to another for failure or load balancing needs. This solution supports a failover of all the node's load as one unit to another node. However, if no locks are outstanding, individual IP addresses can be moved to other nodes for load balancing purposes. GPFS Versione 3,2 offre NFS Raggruppati il che è un insieme di caratteristiche e strumenti che supportano una soluzione di alta disponibilità per NFS exporting file system. -alcuni o tutti i nodi del cluster GPFS possono esportare lo stesso file system ai client NFS. (supportata solo per sistemi con Linux) Le funzionalita del Clustered NFS comprendono: -Monitoraggio: Ogni nodo del cluster NFS gestisce un programma di utilità di monitoraggio NFS che controlla il GPFS, il server NFS e componenti di rete sul nodo. Dopo il rilevamento di guasti il monitoraggio di utilità può invocare uno di failover. -Failover: La procedura di failover automatico trasferisce il servizio di carico NFS dal nodo fallente ad un’altro nodo nel NFS cluster. Il fallimento è gestito dal cluster GPFS, tra cui NFS server IP failover e file system lock e state recovery. (???) -bilanciamento del carico: bilanciamento del carico è basato su indirizzo IP. L'indirizzo IP è l'unità di carico che può essere spostato da un nodo ad un altro per il fallimento o il bilanciamento del carico esigenze. Questa soluzione supporta il failover di tutti i carichi del nodo come una sola unità a un altro nodo. Tuttavia, se non c’e’ alcun lock imminente, i singoli indirizzi IP possono essere trasferiti ad altri nodi per il bilanciamento del carico.

Known GPFS limitations
Number of filesystems < 256 Number of storage pools < 9 Number of cluster nodes < 4096 (2441 tested) Single disk (LUN) size = it is now an OS and disk limitation only Filesystem size < 299 bytes (2 PB tested) Number of files < 2*109 Does not support the Red Hat EL 4.0 uniprocessor (UP) kernel. Does not support the RHEL 3.0 and RHEL 4.0 hugemem kernel. Although GPFS is a POSIX-compliant file system, some exceptions apply to this: Memory mapped files are not supported in this release. The stat() is not fully supported. mtime, atime and ctime returned from the stat() system call may be updated slowly if the file has recently been updated on another node Numero di filesystem <256 Numero di storage pools <9 Numero di nodi cluster <4096 (2441 testate) Dimensione singolo disco (LUN) = ora è un sistema operativo e solo limitazione del disco Filesystem di dimensione <2^99 byte (2 PB testato) Numero di file <2 * 10^9 Non supporta kernel Red Hat EL 4,0 uniprocessore (UP). Non supporta kernel RHEL 3,0 e 4,0 RHEL hugemem. Anche se GPFS è un POSIX-compatibile file system, vi sono alcune eccezioni: -Memory mapped file non sono supportati in questa versione. -La stat () non è completamente supportata. mtime, ctime e atime tornanti dalla chiamata di sistema stat () possono essere aggiornati lentamente se il file è stato recentemente aggiornato su un altro nodo

Where to Find Things in GPFS
Some useful GPFS directories /usr/lpp/mmfs /bin... commands (binary and scripts) most GPFS commands begin with "mm" /gpfsdocs... pdf and html versions of basic GPFS documents /include... include files for GPFS specific APIs, etc. /lib... GPFS libraries (e.g., libgpfs.a, libdmapi.a) /samples... sample scripts, benchmark codes, etc. /var/adm/ras error logs files... mmfs.log.<time stamp>.<hostname> (new log every time GPFS restarted) links... mmfs.log.latest, mmfs.log.previous /tmp/mmfs used for GPFS dumps sysadm must create this directory see mmconfig and mmchconfig /var/mmfs GPFS configuration files same directory structure for both AIX and Linux systems Today's trivia question... Question: What does mmfs stand for? Answer: Multi-Media File System... predecessor to GPFS in the research lab

GPFS FAQs Common GPFS FAQs GPFS for AIX FAQs GPFS for Linux FAQs
GPFS for AIX FAQs GPFS for Linux FAQs COMMENT These web pages are very helpful documents on GPFS. The GPFS development team keeps them relatively up to date.

Comments on Selected GPFS Manuals in
GPFS Documentation Concepts, Planning and Installation Guide One of the most helpful manuals on GPFS... it provides an excellent conceptual overview of GPFS. If this were a university class, this manual would be your assigned reading. :-> Administration and Programming Reference Documents GPFS related administrative procedures and commands as well as an API guide for GPFS extensions to the POSIX API. The command reference is identical to man pages. Problem Determination Guide Many times, GPFS error messages in the mmfs.log files have an error number. You can generally find these referenced in this guide with a brief explanation regarding the cause of the message. They will often point to likely earlier error messages helping you to find the cause of the problem as opposed to its symptom. Data Management API Guide Documentation available online at... Available with the GPFS SW distribution in /usr/lpp/mmfs/gpfsdocs note to sysadm's... Be sure to install this directory! Also install the man pages! IBM Redbooks and Redpapers do a search on GPFS

File system A GPFS file system is built from a collection of disks which contain the file system data and metadata. A file system can be built from a single disk or contain thousands of disks, storing Petabytes of data. A GPFS cluster can contain up to 256 mounted file systems. There is no limit placed upon the number of simultaneously opened files within a single file system. As an example, current GPFS customers are using single file systems up to 2PB in size and others containing tens of millions of files Un file system GPFS è costruito da una collezione di dischi che contengono il file system di dati e metadati. Un file system può essere costruito da un singolo disco o contenere migliaia di dischi, contenente petabyte di dati. Un GPFS cluster può contenere fino a 256 file system montati. Non vi è alcun limite posto al numero di file aperti contemporaneamente all'interno di un singolo file system. Ad esempio, i clienti GPFS odierni utilizzano un unico file system fino a 2PB in termini di dimensioni e altri contenenti decine di milioni di file

GPFS Features Main Features
Disk scaling allowing large, single instantiation global file systems (2 PB tested) Node scaling (2300+ nodes) allowing large clusters and high BW (many GB/s) Multi-cluster architecture (i.e., grid) Journaling (logging) File System - logs information about operations performed on the file system meta-data as atomic transactions that can be replayed Data Management API (DMAPI) - Industry-standard interface allows third-party applications (e.g. TSM) to implement hierarchical storage management Caratteristiche principali -Scalaggio disco che consentono grandi dimensioni, di un'istanza di unico file system globale (testato 2 PB) -Scalaggio nodi ( nodi) che consenta grandi cluster e di elevata BW (molti GB / s) - architettura Multi-cluster (i.e., griglia) -Journaling (registrazione) File System - log di informazioni su operazioni effettuate su file system meta-dati come operazioni atomiche che possono essere ripetute -Data Management API (DMAPI) - Lo standard industriale di interfaccia consente applicazioni di terze parti (ad esempio, TSM) per l'attuazione di storage management gerarchico

Data availability Fault tolerance File system health monitoring
Clustering – node failure Storage system failure – data replication File system health monitoring Extensive logging and automated recovery actions in case of failure appropriate recovery action is taken automatically Data replication available for Journal logs; Data Metadata Connection retries If the LAN connection to a node fails GPFS will automatically try and reestablish the connection before making the node unavailable Tolleranza di fallimento -Clustering – fallimento di nodo - fallimento Sistema di stoccaggio - replica dei dati monitoraggio sanitario del File System -Scrittura di giornale estesa e recupero automatico in caso di fallimento -in caso di bisogno l'azione di recupero è ripresa automaticamente Replica dei dati disponibili per -Journal logs; -Dati -Metadati Tentativi di ri-connessione -Se la connessione LAN ad un nodo non GPFS fallisce, automaticamente cercare di ristabilire la connessione prima di rendere il nodo non disponibile

Installation Very simple (2 steps) Install 4 RPM packages:
gpfs.base gpfs.msg.en_US gpfs.docs gpfs.gpl Build Linux portability interface (see /usr/lpp/mmfs/src/README) cd /usr/lpp/mmfs/src make Autoconfig make World make InstallImages

Installation (comments)
Updates are freely available from official GPFS site Passwordless access needed from any to any node within cluster Rsh or Ssh must be configured accordingly Dependencies compat-libstdc++ xorg-x11-devel (imake required for Autoconfig) No need to repeat portability layer build on all hosts. Once compiled, copy the binaries (5 kernel modules) to all other nodes (with the same kernel and arch/hardware) Aggiornamenti sono liberamente disponibili dal sito ufficiale di GPFS Password di accesso necessari da qualsiasi a qualsiasi nodo all'interno del cluster -Rsh o SSH devono essere configurati di conseguenza Dipendenze -compat-libstdc xorg-x11-devel (imake necessari per autoconfig) Non è necessario ripetere il portability layer build su tutti gli host. Una volta compilato, copiare i file binari (5 moduli del kernel) per tutti gli altri nodi (con lo stesso kernel e arch / hardware)

Administration SNMP interface
Consistent with standard Linux file system administration Simple CLI, most commands can be issued from any node in the cluster No Java and graphic libraries dependency Extensions for clustering aspects A single command can perform an action across the entire cluster Support for Data Management API (IBM’s implementation of X/Open data storage management API) Rolling upgrades allow to upgrade individual nodes in the cluster while the file system remains online. Quotas management Enable control and monitor file system usage by users and groups across the cluster Snapshot function Can be used to preserve the file system's contents at a single point in time SNMP interface allow monitoring by network management applications Coerente con l’amministrazione di FS standard Linux -CLI semplice, la maggior parte dei comandi possono essere emessi da qualsiasi nodo del cluster -Niente Java e librerie grafiche di dipendenza Estensioni per gli aspetti di clustering -Un singolo comando può eseguire un'azione in tutto il cluster Supporto per l’API di gestione dei dati (attuazione di X / Open dati di storage management API da parte di IBM) Aggiornamenti Continui -consentono di aggiornare i singoli nodi del cluster, mentre il file system rimane on-line. Quote di gestione -Consentire il controllo e il monitoraggio del file system di utilizzo di utenti e gruppi in tutto il cluster Funzione Snapshot -Puo’ essere utilizzata per preservare i contenuti del file system di un istante nel tempo Interfaccia SNMP -consente il controllo di applicazioni di gestione della rete

"mm list" commands GPFS provides a number of commands to list parameter settings, configuration components and other things. COMMENT: By default, nearly all of the mm commands require root authority to execute. However, many sysadmins reset the permissions on mmls commands to allow programmers and others to execute them as they are very useful for the purposes of problem determination and debugging. GPFS fornisce una serie di comandi per elencare le impostazioni di parametro, la configurazione di componenti e altre cose. Commento: Per impostazione predefinita, quasi tutti i comandi mm richiedono autorità di root per essere eseguiti. Tuttavia, molti sysadmins reimpostano le autorizzazioni su comandi mmls per consentire ai programmatori e altri per l'esecuzione di essi dato che sono molto utili ai fini della individuazione dei problemi e il debug.

Selected "mmls" Commands
mmlsfs <device name> Without specifying any options, it lists all file system attributes ~]# mmlsfs gpfs flag value description -f Minimum fragment size in bytes -i Inode size in bytes -I Indirect block size in bytes -m Default number of metadata replicas -M Maximum number of metadata replicas -r Default number of data replicas -R Maximum number of data replicas -j cluster Block allocation type -D nfs File locking semantics in effect -k all ACL semantics in effect -a Estimated average file size -n Estimated number of nodes that will mount file system -B Block size -Q none Quotas enforced none Default quotas enabled -F Maximum number of inodes -V ( ) File system version -u yes Support for large LUNs? -z no Is DMAPI enabled? -L Logfile size -E no Exact mtime mount option -S no Suppress atime mount option -K whenpossible Strict replica allocation option -P system Disk storage pools in file system -d disk_hdb_gpfs_01_01;disk_hdb_gpfs_01_02;disk_hdb_gpfs_01_03 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /gpfs Default mount point

Selected "mmls" Commands
mmlsconfig Without specifying any options, it lists all current nodeset configuration info ~]# mmlsconfig Configuration data for cluster gpfs cr.cnaf.infn.it: clusterName gpfs cr.cnaf.infn.it clusterId clusterType lc autoload yes minReleaseLevel dmapiFileHandleSize 32 pagepool 256M dmapiWorkerThreads 24 File systems in cluster gpfs cr.cnaf.infn.it: /dev/gpfs Senza specificare alcuna opzione, elenca tutti le attuali informazioni di configurazione nodeset

Other Selected "mmls" Commands
mmlsattr <file name> query file attributes mmlscluster display current configuration information for a GPFS cluster mmlsdisk <device> [-d “disk names list”] display current configuration and state of the disks in a file system mmlsmgr display which node is the file system manager for the specified file systems mmlsnsd display current NSD information in the GPFS cluster NOTES See documentation for other parameters and options. . mmlsattr <file name> -query attributi di file mmlscluster -visualizzazione delle informazioni sulla configurazione attuale per un cluster GPFS mmlsdisk <dispositivo> [-d "elenco nomi disco"] -visualizzazione configurazione corrente e lo stato dei dischi in un file di sistema mmlsmgr -visualizzazione di quale nodo è il file system manager per i sistemi di file specificato mmlsnsd -visualizzazione informazione corrente NSD nel cluster GPFS NOTE -Vedere la documentazione per altri parametri e opzioni.

Testbed Almost all further examples are referred to a simple 4-node cluster used at CNAF for testing purposes: 4 dual cpu gpfs (I/O server) gpfs (I/O server) gpfs (I/O server) TSM-TEST-1 (client) NSD: internal IDE hdd 20GB (hdb) disk_hdb_gpfs_01_01 disk_hdb_gpfs_01_02 disk_hdb_gpfs_01_03 Interconnect: 1Gb ethernet nsd server client 1 Gb/s Ethernet GPFS Quasi tutti gli altri esempi si riferiscono ad un semplice 4-nodo di cluster a CNAF utilizzati a fini di prova:

Selected "mm" Commands GPFS provides a number of commands needed to create the file system. These commands of necessity require root authority to execute. mmcrcluster - Creates a GPFS cluster from a set of nodes. >mmcrcluster -n gpfs.nodelist \ -p gpfs cr.cnaf.infn.it \ -s gpfs cr.cnaf.infn.it \ -r /usr/bin/ssh \ -R /usr/bin/scp \ -C test.cr.cnaf.infn.it \ -U cr.cnaf.infn.it > >cat gpfs.nodelist gpfs-01-01:quorum-manager gpfs-01-01:quorum GPFS fornisce una serie di comandi necessari per creare il FS. Questi comandi di necessità richiedono autorità root per essere eseguiti. mmcrcluster - Crea un cluster GPFS da una serie di nodi.

Selected "mm" Commands mmstartup and mmshutdown
startup and shutdown the mmfsd daemons if necessary, mount file system after running mmstartup properly configured, mmfsd will startup automatically (n.b., no need to run mmstartup); if it can not start for some reason, you will see runmmfs running and a lot of messages in /var/adm/ras/mmfs.log.latest mmgetstate - displays the state of the GPFS daemon on one or more nodes: ~]# mmgetstate -a Node number Node name GPFS state gpfs active gpfs active gpfs active TSM-TEST active avviano e chiudono i daemon mmfsd -se necessario, montare il file system dopo l'esecuzione ‘mmstartup’ -configurato correttamente, mmfsd sarà automaticamente di avvio (nb, non è necessario eseguire mmstartup); se non può iniziare per qualche motivo, vedrete runmmfs in esecuzione e un sacco di messaggi in / var / adm / ras / mmfs.log.latest -mmgetstate - visualizza lo stato del demone GPFS su uno o più nodi:

Selected "mm" Commands mmcrnsd
Creates and globally names Network Shared Disks for use by GPFS. mmfsd daemon must be running to execute mmcrnsd (i.e., do mmstartup first) > mmcrnsd -F disk.lst disk.lst is a "disk descriptor file whose entries are in the format DiskName:ServerList::DiskUsage:FailureGroup:DesiredName:StoragePool DiskName: The disk name as it appears in /dev ServerList: Is a comma separated list of NSD server nodes. Up to eight NSD servers in this list. preferentially use the first server on the list. If the first server is not available, the NSD will use the next available server on the list DiskUsage: dataAndMetadata (default) or dataOnly or metadataOnly FailureGroup: GPFS uses this information during data and metadata placement to assure that no two replicas of the same block are written in such a way as to become unavailable due to a single failure. All disks that are attached to the same adapter or NSD server should be placed in the same failure group. DesiredName: Specify the name you desire for the NSD to be created. Default format... gpfs<integer>nsd dsk.lst is modified for use as the input file to the mmcrfs command Crea e chiama a livello globale dischi condivisi per l'uso di GPFS. mmfsd daemon deve essere in esecuzione per l'esecuzione di mmcrnsd (vale a dire, fare mmstartup primo) > Mmcrnsd-F disk.lst disk.lst è un "disco Descrittore file il cui voci sono nel formato DiskName: ServerList:: DiskUsage: FailureGroup: DesiredName: StoragePool DiskName: Il disco nome che appare in / dev ServerList: è un elenco separato da virgole di NSD server di nodi. Fino a otto server NSD in questa lista. utilizzare di preferenza il primo server in elenco. Se il primo server non è disponibile, l'uso NSD la prossima server disponibili sulla lista DiskUsage: dataAndMetadata (impostazione predefinita) o dataOnly o metadataOnly FailureGroup: GPFS utilizza queste informazioni nel corso dei dati e dei metadati di collocamento per garantire che non vi sono due repliche dello stesso blocco sono scritti in modo tale da diventare non disponibile a causa di un unico fallimento. Tutti i dischi che sono collegati alla stessa scheda o NSD server dovrebbe essere collocato nello stesso gruppo di fallimento. DesiredName: specificare il nome desiderato per il NSD essere creato. Formato predefinito ... gpfs <integer> NSD dsk.lst viene modificato per l'uso come il file di input per il comando mmcrfs

Selected "mm" Commands Disk Descriptor Files
> cat disk.lst /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_01 /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_02 /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_03 > mmcrnsd –F disk.lst … > cat disk.list # /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_01 disk_hdb_gpfs_01_01:::dataAndMetadata:4001 # /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_02 disk_hdb_gpfs_01_02:::dataAndMetadata:4002 # /dev/hdb:gpfs cr.cnaf.infn.it::::disk_hdb_gpfs_01_03 disk_hdb_gpfs_01_03:::dataAndMetadata:4003 NOTES This is the results from a single node with internal (IDE) disks Using disk descriptor defaults. The integer in nsd disk names is based on a counter. If you delete and re-create the file system, the counter is not generally re-initialized

Selected "mm" Commands mmcrfs <mountpoint> <device name> <options> Create a GPFS file system -F specifies a file containing a list of disk descriptors (one per line) this is the output file from mmcrnsd -A do we mount file system when starting mmfsd (default = yes) -B block size (16K, 64K, 128K, 256K, 512K, 1024K,2M,4M) -E specifies whether or not to report exact mtime values -m default number of copies (1 or 2) of i-nodes and indirect blocks for a file -M default max number of copies of inodes, directories, indirect blocks for a file -n estimated number of nodes that will mount the file system -N max number of files in the file system (default = sizeof(file system)/1M -Q activate quotas when the file system is mounted (default = NO) -r default number of copies of each data block for a file -R default maximum number of copies of data blocks for a file -S suppress the periodic updating of the value of atime -v verify that specified disks do not belong to an existing file system -z enable or disable DMAPI on the file system (default = no) Typical example mmcrfs /gpfs gpfs -F disk.lst -A yes -B 1024k -v no

"mm change" commands. GPFS provides a number of commands to change configuration and file system parameters after being initially set. There are some GPFS parameters which are initially set only by default; the only way to modify their value is using the appropriate mmch command. N.B., There are restrictions regarding changes that can be made to many of these parameters; be sure to consult the Concepts, Planning and Installation Guide for tables outlining what parameters can be changed and under which conditions they can be changed. See the Administration and Programming Reference manual for further paramter details.

Selected "mmch" Commands
mmchconfig change GPFS configuration attributes originally set (explicitly or implicitly) by mmconfig relative to mmconfig, parameter IDs may be different mmchconfig Attribute=value[,Attribute=value...] [-i | -I] [-N {Node[,Node...] NodeFile | NodeClass}] parameters and options -N list of node names (default is all nodes in the cluster) can not be used for all options autoload (same as -a) dataStructureDump (same as -D) maxblocksize Changes the maximum file system block size. maxMBpS (data rate estimate (MB/s) on how much data can be transferred in or out of 1 node) The value is used in calculating the amount of IO that can be done to effectively prefetch data for readers and write-behind data from writers. By lowering this value, you can artificially limit how much IO one node can put on all of the disk servers. This is useful in environments in which a large number of nodes can overrun a few virtual shared disk servers. The default is 150 MB/s which can severally limity performance on HPS ("federation") based systems. maxFilesToCache (same as -M) maxStatCache (specifies number of i-nodes to keep in statcache) pagepool (same as -p) following options apply only to dataStructureDump, maxblocksize, pagepool -i changes are immediate and permanent -l changes are immediate, but do not persist after mmfsd daemon is restarted

Managing disks mmdf - Queries available file space on a GPFS file system. ~]# mmdf gpfs disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments Disks in storage pool: system (Maximum disk size allowed is 61 GB) disk_hdb_gpfs_01_ yes yes ( 86%) ( 0%) disk_hdb_gpfs_01_ yes yes ( 86%) ( 0%) disk_hdb_gpfs_01_ yes yes ( 86%) ( 0%) (pool total) ( 86%) ( 0%) ============= ==================== =================== (total) ( 86%) ( 0%) Inode Information Number of used inodes: Number of free inodes: Number of allocated inodes: Maximum number of inodes:

Deleting a disk ~]# mmdeldisk gpfs "disk_hdb_gpfs_01_03" Deleting disks ... Scanning system storage pool Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scanning file system metadata, phase 3 ... Scanning file system metadata, phase 4 ... Scanning user file metadata ... 100 % complete on Wed Jun 4 17:22: Checking Allocation Map for storage pool 'system' tsdeldisk completed. mmdeldisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. ~]# mmdf gpfs disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments Disks in storage pool: system (Maximum disk size allowed is 61 GB) disk_hdb_gpfs_01_ yes yes ( 81%) ( 0%) disk_hdb_gpfs_01_ yes yes ( 81%) ( 0%) (pool total) ( 81%) ( 0%) ============= ==================== =================== (total) ( 81%) ( 0%) Inode Information Number of used inodes: Number of free inodes: Number of allocated inodes: Maximum number of inodes:

Filesets and Storage pools
New feature of GPFS v3.1 Storage pools allow the creation of disk groups within a file system (hardware partitioning) Filesets is a sub-tree of the file system namespace (Namespace partitioning). For example, it can be used as administrative boundaries to set quotas.

Adding a disk (and a storage pool)
~]# mmadddisk gpfs "disk_hdb_gpfs_01_03:::dataOnly:::data" The following disks of gpfs will be formatted on node gpfs cr.cnaf.infn.it: disk_hdb_gpfs_01_03: size KB Extending Allocation Map Creating Allocation Map for storage pool 'data' Flushing Allocation Map for storage pool 'data' Disks up to size 52 GB can be added to storage pool 'data'. Checking Allocation Map for storage pool 'data' Completed adding disks to file system gpfs. mmadddisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. ~]# mmdf gpfs disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments Disks in storage pool: system (Maximum disk size allowed is 61 GB) disk_hdb_gpfs_01_ yes yes ( 65%) ( 0%) disk_hdb_gpfs_01_ yes yes ( 65%) ( 0%) (pool total) ( 65%) ( 0%) Disks in storage pool: data (Maximum disk size allowed is 52 GB) disk_hdb_gpfs_01_ no yes (100%) ( 0%) (pool total) (100%) ( 0%) ============= ==================== =================== (data) ( 77%) ( 0%) (metadata) ( 65%) ( 0%) (total) ( 77%) ( 0%) Inode Information Number of used inodes: Number of free inodes: Number of allocated inodes: Maximum number of inodes:

Initial Placement policy
Two storage pools: System – data and metadata Data – data only Placement policy example use pool “data” until 99% full, then use pool “system” RULE ‘rule1' SET POOL 'data' LIMIT (99) RULE ‘default' SET POOL 'system' Place all files with UID>2048 in pool “data”,and all others in “system”: RULE 'rule1' SET POOL 'data' WHERE USER_ID>2048 RULE 'default' SET POOL 'system‘ Install placement policy mmchpolicy Device PolicyFilename –I yes

User defined polices File placement policies File management policies
Define where the data will be created (appropriate storage pool) Rules are determined by attributes like File name User name Fileset File management policies Possibility to move data from one pool to another without changing file location in the directory structure Change replication status Prune file system (deleting files as defined by policy) Determined by attributes like Access time Path name Size of the file

Policy rules examples If the storage pool named pool_1 has an occupancy percentage above 90% now, bring the occupancy percentage of pool_1 down to 70% by migrating the largest files to storage pool pool_2: RULE 'mig1' MIGRATE FROM POOL 'pool_1' THRESHOLD(90,70) WEIGHT(KB_ALLOCATED) TO POOL 'pool_2' Delete files from the storage pool named pool_1 that have not been accessed in the last 30 days, and are named like temporary files or appear in any directory that is named tmp: RULE 'del1' DELETE FROM POOL 'pool_1' WHERE (DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) > 30) AND (lower(NAME) LIKE '%.tmp' OR PATH_NAME LIKE '%/tmp/%')

Sharing data between clusters
GPFS allows to share data across clusters Permit access to specific file systems from another GPFS cluster higher performance levels than file sharing technologies like NFS or Samba requires a trusted kernel at both the owning and sharing clusters both LAN and SAN can be used as cluster interconnect multi-cluster configuration with both LAN and mixed LAN and SAN connections

Cross-cluster file system access

Cross-cluster file system access (requirements)
OpenSSL must be installed on all nodes in the involved clusters See the GPFS Frequently Asked Questions at publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html for current OpenSSL version requirements and for information on the supported cipher suites. The procedure to set up remote file system access involves the generation and exchange of authorization keys between the two clusters. administrator of the GPFS cluster that owns the file system needs to authorize the remote clusters that are to access it administrator of the GPFS cluster that seeks access to a remote file system needs to define to GPFS the remote cluster and file system whose access is desired In this example, cluster1 is the name of the cluster that owns and serves the file system to be mounted, and cluster2 is the name of the cluster that desires to access the file system.

Cross-cluster file system access
On cluster1, the system administrator generates a public/private key pair. (The key pair is placed in /var/mmfs/ssl): mmauth genkey new On cluster1, the system administrator enables authorization by issuing: mmauth update . -l AUTHONLY This should be done when GPFS is stopped on all nodes The system administrator of cluster1 now gives the file /var/mmfs/ssl/id_rsa.pub to the system administrator of cluster2, who desires to access the cluster1 file systems. This operation must occur outside of the GPFS command environment. On cluster2, the system administrator generates a public/private key pair. mmauth genkey new On cluster2, the system administrator enables authorization by issuing: mmauth update . -l AUTHONLY The system administrator of cluster2 gives file /var/mmfs/ssl/id_rsa.pub to the system administrator of cluster1. This operation must occur outside of the GPFS command environment.

Cross-cluster file system access (cont.)
On cluster1, the system administrator authorizes cluster2 to mount file systems owned by cluster1 utilizing the key file received from the administrator of cluster2: mmauth add cluster2 -k cluster2_id_rsa.pub where: cluster2 Is the real name of cluster2 as given by the mmlscluster command in cluster2. cluster2_id_rsa.pub Is the name of the file obtained from the administrator of cluster2 in Step 6. On cluster1, the system administrator authorizes cluster2 to mount specific file systems owned by cluster1: mmauth grant cluster2 -f /dev/gpfs On cluster2, the system administrator now must define the cluster name, contact nodes and public key for cluster1: mmremotecluster add cluster1 -n node1,node2,node3 -k \ cluster1_id_rsa.pub where: cluster1 Is the real name of cluster1 as given by the mmlscluster command. node1, node2, and node3 Are nodes in cluster1. The hostname or IP address must refer to the communications adapter that is used by GPFS as given by the mmlscluster. cluster1_id_rsa.pub Is the name of the file obtained from the administrator of cluster1 in Step 3. This permits the cluster desiring to mount the file system a means to locate the serving cluster and ultimately mount its file systems.

Cross-cluster file system access (cont.)
On cluster2, the system administrator issues one or more mmremotefs commands to identify the file systems in cluster1 that are to be accessed by nodes in cluster2: mmremotefs add /dev/mygpfs -f /dev/gpfs -C cluster1 -T /mygpfs where: /dev/mygpfs Is the device name under which the file system will be known in cluster2. /dev/gpfs Is the actual device name for the file system in cluster1. cluster1 Is the real name of cluster1 as given by the mmlscluster command on a node in cluster1. /mygpfs Is the local mount point in cluster2. Mount the file system on cluster2, with the command: mmmount /dev/mygpfs

Cross-cluster file system access (summary)
Commands that the administrators of the two clusters need to issue so that the nodes in cluster2 can mount the remote file system fs1, owned by cluster1, assigning rfs1 as the local name with a mount point of /rfs1. Cluster2 mmauth genkey new mmshutdown -a mmauth update . -l AUTHONLY mmstartup -a Cluster1 mmauth genkey new mmshudown -a mmauth update . -l AUTHONLY mmstartup -a Exchange public keys (file /var/mmfs/ssl/id_rsa.pub) mmauth add cluster2 ... mmauth grant cluster2 -f fs1 ... mmremotecluster add cluster1 ... mmremotefs add rfs1 -f fs1 \ -C cluster1 -T /rfs1

Got Troubles? Most common problems: Recovery
Network (problems in inter-node communication) ping “node” shows high RTT Lost of quorum Check cluster status (mmgetstate -a) Disks problems Check status of NSDs (mmlsdisk <device>) Check for “waiters” (mmfsadm dump waiters) Check for error messages in /var/adm/ras/mmfs.log.latest Recovery Don’t hurry, especially if you haven’t understand what’s going on Take a coffee break (usually GPFS will autorecover in ~10min) Node problem: Restart GPFS on those nodes (mmshutdown –N node1, mmstartup –N node1), or reboot Disks problems: some disks are “down” bring the disks UP by: mmchdisk <device> start -d “disk names list” Follow the “Problem Determination Guide” ( )

Vladimir Sapunenko INFN-CNAF

Similar presentations

Presentation on theme: "Vladimir Sapunenko INFN-CNAF"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vladimir Sapunenko INFN-CNAF

Similar presentations

Presentation on theme: "Vladimir Sapunenko INFN-CNAF"— Presentation transcript:

Similar presentations

About project

Feedback