Download presentation
Presentation is loading. Please wait.
1
The European DataGrid Project Team http://www.eu-datagrid.org
Fabric Management The European DataGrid Project Team
2
Contents Definition Fabric Management architecture overview
User job management overview Present status (R1.2) LCFG LCAS + edg_gatekeeper In this tutorial we will present the Fabric Management architecture and present status of implementation with more details.
3
What is Fabric Management
Definitions: Cluster (or Farm): “A collection of computers on a network that can function as a single computing resource through the use of tools which hide the underlying physical structure“. Fabric: "A complete set of computing resources (processors, memory, storage) operated in a coordinated fashion with a uniform set of management tools". Functionality: Enterprise system administration – scalable to ~10K nodes Provision for running grid jobs Provision for running local jobs Enterprise system administration – scalable to ~10K nodes Automated installation and maintenance of nodes Resource management (batch, interactive) Monitoring of events and performance Fault tolerance and recovery actions Fabric Configuration Management Provision for running grid jobs Authorization according to local policies Mapping grid credentials to local ones Publication of grid resources and job information Provision for running local jobs Sharing of resources according to local policies
4
Architecture logical overview
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage Fabric mgt subsystems Other services This slide shows the WP4-view of the EDG interactions. The WP4 services are Fabric Gridification, Resource Management, Installation & Node Management, Monitoring and Fault Tolerance and Configuration Management.
5
Architecture logical overview
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. Fabric mgt subsystems Other services The Gridification subsystem provides the interface from the Grid to the resources available inside a fabric for batch and interactive CPU services. It provides the interface for job submission/control and resource publication to the Grid services. It also provides functionality for local authentication and policy-based authorisation and mapping Grid credentials to local credentials.
6
Architecture logical overview
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt (WP2) Grid Data Storage Fabric mgt subsystems Other services - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting). The Resource Management subsystem is a layer on top of the fabric’s available cluster batch and interactive services. While the Grid Resource broker manages workload distribution between fabrics, the Resource Management subsystem manages the workload distribution and resource sharing of all batch and interactive services inside a fabric, according to defined policies and user allocations. Also known as local resource management systems – LRMS.
7
Architecture logical overview
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage (WP5) Fabric mgt subsystems Other services - provides the tools to install and manage all software running on the fabric nodes; Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories. The Installation Management subsystem handles the initial installation of computing fabric nodes. It also handles software distribution, configuration and maintenance according to information stored in the Configuration Management subsystem. Installation & Node Mgmt
8
Architecture logical overview
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. Fabric mgt subsystems The Configuration Management subsystem provides the components to manage and store centrally all fabric configuration information. This includes the configuration of all Fabric Management subsystems as well as information about the fabric hardware, system and services.
9
Architecture logical overview
- provides the tools for gathering monitoring information on fabric nodes; central measurement repository stores all monitoring information; fault tolerance correlation engines detect failures and trigger recovery actions. Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services WP4 subsystems Other Wps Resource Broker Data Mgmt Grid Data Storage The Fabric Monitoring and Fault Tolerance subsystem provides the necessary components for gathering, storing and retrieving performance, functional, setup and environmental data for all fabric elements. It also provides the means to correlate that data and execute corrective actions.
10
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Fabric mgt subsystems Other services Resource Broker Data Mgmt Grid Data Storage The next few slides go through the lifecycle of a grid job, as seen from the fabric point of view.
11
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage Fabric mgt subsystems - Submit job USE CASE: Grid job submission A Grid user wants to execute a simple job on a fabric via the Grid Resource Broker. Fabric Management subsystems directly involved: Gridification subsystem Resource Management subsystem
12
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage Fabric mgt subsystems - publish resource and accounting information The Monitoring subsystem ,through the grid fabric information service, will supply aggregate or abstracted information content about the local fabric to the Grid Information and Monitoring service.
13
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage Fabric mgt subsystems - Optimized selection of site The Resource Broker selects a Computing Element.
14
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage Authorize Map grid local credentials Fabric mgt subsystems The Computing Element creates a jobID, and updates its repository with a new attribute set containing the jobID, the JDL description, and the presented credential. The Computing Element calls the gridification subsystem (LCAS module) via the get_fabric_authorisation() interface. The LCAS calls all the registered authorisation modules in sequence, with the JDL and the credential as input. If a module rejects the request, the LCAS returns an error to the CE, and the latter returns an error to the Grid Resource Broker, stating that local authorisation has failed.
15
User job management (Grid and local)
Farm A (LSF) Farm B (PBS) Grid User (Mass storage, Disk pools) Local User Monitoring Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt (WP2) Grid Data Storage Select an optimal batch queue and submit Return job status and output Fabric mgt subsystems The Computing Element contacts the RMS Scheduler for preparing the request, putting the job into schedule, and submitting the job to the selected cluster batch system via its proxy. When the job is reported finished by the cluster batch system, the Scheduler returns the job return status (possibly including an error message) to the ComputingElement. The ComputingElement reports to the Resource Broker the job result.
16
Fabric Management @ Release 1.2
Installation and configuration: LCFG (Local ConFiGuration system) Gridification: LCAS + edg_gatekeeper Snapshot of implemented Fabric Management services at Release 1.2
17
LCFG (Local ConFiGuration system)
LCFG is originally developed by the Computer Science Department of Edinburgh University Handles automated installation, configuration and management of machines Basic features: automatic installation of O.S. installation/upgrade/removal of all (rpm-based) software packages centralized configuration and management of machines extendible to configure and manage EDG middleware and custom application software LCFG related links: Link to the original site at Edinburgh’s University: DataGrid documentation on LCFG: Tutorial on how to write LCFG objects:
18
LCFG system architecture
+inet.services telnet login ftp +inet.allow telnet login ftp sshd +inet.allow_telnet ALLOWED_NETWORKS +inet.allow_login ALLOWED_NETWORKS +inet.allow_ftp ALLOWED_NETWORKS +inet.allow_sshd ALL +inet.daemon_sshd yes ..... +auth.users mickey +auth.userhome_mickey /home/mickey +auth.usershell_mickey /bin/tcsh Config files ldxprof Load Profile Generic Component Object rdxprof Read LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Server Server-side: The configuration information for all nodes, expresed as abstract configuration parameters, is stored in a central repository Abstract configuration parameters for all nodes stored in a central repository A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services
19
LCFG system architecture
ldxprof Load Profile Generic Component Object rdxprof Read LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Server <inet> <allow cfg:template="allow_$ tag_$ daemon_$"> <allow_RECORD cfg:name="telnet"> <allow> , </allow> </allow_RECORD> ..... </auth> <user_RECORD cfg:name="mickey"> <userhome>/home/MickeyMouseHome</userhome> <usershell>/bin/tcsh</usershell> </user_RECORD> XML profiles Server-side: when the config files are modified, a tool (mkxprof) recreates the new xml profile for all the nodes affected by the changes this can be done manually or with a daemon periodically checking for config changes and calling mkxprof mkxprof can notify via UDP the nodes affected by the changes Abstract configuration parameters for all nodes stored in a central repository A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services
20
LCFG system architecture
ldxprof Load Profile Generic Component Object rdxprof Read LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Server Profile Object inet auth Client-side: another tool (rdxprof) downloads the new profile from the server usually activated by an LCFG component at boot can be configured to work as daemon periodically polling the server daemon waiting for notifications started by cron at predefined times An LCFG object is notified when any of its configuration parameters has changed, so it can re-configure (and restart, if needed) the service. Abstract configuration parameters for all nodes stored in a central repository A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services
21
LCFG system architecture
ldxprof Load Profile Generic Component Object rdxprof Read LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Server /etc/shadow /etc/group /etc/passwd .... mickey:x:999:20::/home/Mickey:/bin/tcsh Profile Object inet auth Client-side: when the configuration parameters are modified, the configuration files are regenerated by the LCFG object. Abstract configuration parameters for all nodes stored in a central repository A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services
22
LCFG system architecture
ldxprof Load Profile Generic Component Object rdxprof Read LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Server Profile Object inet auth Client-side: when the configuration parameters are modified, the configuration files are regenerated by the LCFG object. Abstract configuration parameters for all nodes stored in a central repository A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services
23
LCFG: what’s a component?
Component == object It's a simple shell script (… but moving to Perl) Each component provides a number of “methods” (start, stop, configure,...) which are invoked at appropriate times A simple and typical component behaviour: Started when notified of a configuration change Loads its configuration (locally cached) Configures the appropriate services, by translating config parameters into a traditional config file and reloading a service if necessary (e.g. restarting a init.d service). LCFG provides components to manage the configuration of services of a machine: inet, auth, nfs, cron, ... Admins/mw developers can build new custom components to configure and manage their own applications
24
LCFG component skeleton
#!/bin/bash2 class=mycomp # set the name of the component (mycomp) . /etc/obj/generic # include std methods and definitions Start() { # Start the component } Configure() { # Do the configuration Case “$1” in # 'main' program configure) Configure; *) DoMethod esac
25
LCFG component methods
Start() method: Start() { # ‘Start’ the component Generic_Start; # standard setup steps Configure; # reconfigure the component if [ $? = 0 ]; then OK "Component mycomp started" else Fail "Starting component mycomp" fi } Configure() method: Configure() { # Do the configuration LoadResources myresource1 myresource2 … CheckResources myresource1 myresource2 … # your code do_whatever … return status; ->The specific part in the configure() method will usually contain three steps: # 1. Load myconfig resource as environment variable LoadResources myconfig CheckResources myconfig # 2. Generate (or update) config file sed –e “s%MYCONFIG%$myconfig%” \ /etc/obj/conf/myconfigtemplate > /etc/myconfig # 3. Reload/restart a service if neccessary /etc/rc.d/init.d/myservice reload -> Relevant built-in standard functions LoadResources resource1 resource2 …: loads resources into the environment CheckResources resource1 resource2 …: checks that resources are defined in the environment, fails otherwise Fail string: prints out a [FAIL] message and exits ($? = 1) Warn string: prints out [WARNING] message OK string: prints out [OK] message LogMessage string: Adds message to logfile Only the RED parts are specific to a component!
26
LCFG component example (I)
ldconf component - Configures /etc/ld.so.conf (Cal Loomis) .def file: class ldconf methods start configure conffile paths Resources defined on server: conffile /etc/ld.so.conf paths /usr/local/lib /opt/globus/lib
27
LCFG component example (II)
Component: Configure() method Configure() { LoadResources conffile paths CheckResources conffile paths # Update ld.so.conf file. for i in $paths; do line=`grep $i $conffile` if [ -z "$line" ]; then echo "$i" >> $conffile fi done /sbin/ldconfig; return $?; }
28
Software distribution with LCFG
It is done with an LCFG tool called updaterpms. The standard system packaging format is used: rpm for Linux It is managed via an LCFG component. Functionality: Compares the packages currently installed on the local node with the packages listed in the configuration Computes the necessary install/deinstall/upgrade operations Orders the transaction operations taking into account rpm dependency information Invokes the packager (RPM) with the right operation transaction set Packager functionality: Read operations (transactions) Downloads new packages from repository Executes the operations (installs/removes/upgrades) Note: the rpm command line command is not used by updaterpms, as updaterpms accesses the RPM C libraries directly.
29
Gridification Architecture
Scheduler Gridification component Gridification component Gridification component Gridification component Implemented in R1.2 IMS IMS (WP3) - External to fabric FabNAT FabNAT Information Providers GriFIS Internal to fabric ComputingElement EDG Gatekeeper Globus Gatekeeper Job repository Job Manager SE SE Fabric Monitoring Fabric Monitoring RMS RMS StorageElement farms LCMAPS LCAS plug - ins uid/gid uid/gid other static list other static list tokens tokens wallclocktime walltime The Gridification subsystem is composed of five basic components: · CE, ComputingElement: to mediate the request (eg. job execution, resource reservation) received from any grid entity (such as the Grid Scheduler from WP1) to the Resource Management System. This component is the only one that can be accessed from outside the fabric. · LCAS, local centre authorisation service: to provide local authorisation for requests made to the fabric by grid services. · FLIDS, An automated local certifying entity that can sign certificate requests according to a predefined policy list. · LCMAPS, local credential mapping service: to provide all local credentials needed for jobs allowed into the fabric. · GriFIS, grid fabric information service: to supply aggregate or abstracted information content about the local fabric to the Grid Information and Monitoring service. · FabNAT, fabric NAT gateway: to provide a mechanism to support connections from individual farm nodes to locations outside the fabric, for example for MPI parallel programs, for interactive work or for graphics output. Other acronyms: RMS: Resource Management System IMS: Information System quota check quota check Configuration Configuration Mgmt, ACL evaluation Mgmt, resource use communications Installation Installation Credential Rep. Mgmt within established Mgmt security context Policy FLIDS (Configuration Mgmt) Policy (Configuration Mgmt)
30
LCAS 1.0 (EDG release 1.2) The Local Centre Authorization Service (LCAS) handles authorization requests to the local computing fabric. In this release the LCAS is a shared library, which is loaded dynamically by the globus gatekeeper. The gatekeeper has been slightly modified for this purpose and will from now on be referred to as edg-gatekeeper. The authorization decision of the LCAS is based upon the users' certificate and the job specification in RSL (JDL) format. The certificate and RSL are passed to (plug-in) authorization modules, which grant or deny the access to the fabric. Three standard authorization modules are provided by default: lcas_userallow.mod, checks if user is allowed on the fabric (currently the gridmap file is checked). lcas_userban.mod, checks if user should be banned from the fabric. lcas_timeslots.mod, checks if fabric is open at this time of the day for datagrid jobs. First prototype: shared object (lcas.mod) Dynamic library calls from edg-gatekeeper 3 static authorization modules: Allowed users (grid-mapfile or allowed_users.db) Banned users (ban_users.db) Available timeslots (timeslots.db) No plug-ins supported yet LCFG object obj-lcas installs authorization databases Documentation:
31
Authentication control flow EDG gatekeeper
GLOBUS GLOBUS + LCAS Gatekeeper Gatekeeper TLS auth TLS auth LCAS (so) assist_gridmap assist_gridmap The “plain” Globus-provided gatekeeper accepts jobs from the outside world, verifies that the user certificate has been signed by a trusted certification authority, and subsequently calls “gss_assist_gridmapfile” to obtain local account information. The grid-mapfile has a two-fold role in this process: it defines the authorized users based on their certificate (proxy) subject name AND it gives the local uid to be used for this user. LCAS plugins are typically: a banned-user list (so as to quickly dispose of users that abuse the system) CPU budget and quota checks (based on the RMS information) and jobtype/class/role policies as expressed in more complex GridACLs The LCAS system in implemented as a stand-alone daemon because many of these plugins are complex systems that are difficult to audit for security leaks (the gatekeeper runs as root) Some acronyms: TLS – Transport Level Security (TLSv1 is equal to SSL v3) LCAS – Local Centre Authorization Service Jobmanager-* Jobmanager-* * And store in job repository
32
Summary Two main fabric management components deployed on release 1.2:
Installation and Configuration management functionality: LCFG (Local ConFiGuration system): handles automated installation, configuration and management of machines. Gridification functionality: LCAS (Local Centre Authorization Service) + edg_gatekeeper: handle authorization requests to the local computing fabric. And more components coming in next releases in the areas of fabric monitoring, resource management, and configuration management. Experience and conclusions from this release: automatic installation and configuration of the DataGrid middleware was very useful for the testbed due to the complexity of the software.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.