EPCC, University of Edinburgh DIRAC and SAFE
DIRAC requirements DIRAC serves a variety of different user communities. –These have different computational requirements best served by different types of computer. –User communities are spread across many different institutions. –Resources are geographically distributed and run by multiple organisations. –Some of these resources are provided by existing services with existing procedures. Funding is limited –Mostly only HW was funded. –Need to provide rest of the service as efficiently as possible. –Need to utilise existing infrastructure/processes where possible –Avoid unnecessary complications.
Stakeholders Dirac management –Need overview of usage of resources to inform allocation policy. –Need mechanisms to implement allocation policy. Research communities –Need resource usage information to manage community science programme. –Need mechanisms to manage community membership. –Need mechanisms to manage community resources. Users –Need to be able to request accounts (frequently at remote institutions) –Need to access accounts remotely –Want to get on with science without additional complications.
Level of integration Most requirements for integration are at the management level Experience suggests a strong correlation between user communities and compute resource. –Communities will choose resources appropriate to their science. –Users will want to access the unique features of these resources. –Though projects may span resources most individual users will probably stick to a single system. Global accounts, single-sign-on etc. not essential.
GRID? Computational grid not appropriate –Grids designed to provide uniform access to interchangeable resources. DIRAC resources are complementary not interchangeable. –Provides standard interface but only to features common to all systems Data grid may be more relevant. –Depends on the data handling requirements of user communities. –Need to gather more requirements.
SAFE design principles SAFE has been built to provide a single point of contact for users of national HPC services. –Role essentially that of the ITIL service desk. –Originally deployed for HPCx service, Currently used for HECToR service. Also used for internal EPCC services. Provides a well defined interface for service providers. –Tries to express all requests as standard tickets. –Supports multiple service providers with different support policies. Has to make very few technological assumptions. –Users can come from any academic institution. Can’t assume much more than and Web. –We usually bid to run service in parallel with hardware procurement. We have little say over hardware or system software and need to adapt SAFE quickly to provide service if bid successful.
SAFE design principles II Has to be flexible rather than prescriptive. –Requirements have changed constantly over the 10 years of SAFE development. –Need to be able to quickly implement new reports or policies generated by RCs or policy panels. –Need to maintain access to old data even when current system/policy has changed. –Need to be able to integrate new services into existing instances. –Need to be able to adapt tickets to meet needs of service teams and underlying infrastructure. Controlling our own software gives us a great deal of flexibility. –We have built up an extensive toolbox to allow rapid implementation of new requirements.
What can SAFE offer DIRAC. Software already exists and is already managing BG/Q service (minimal cost). Its designed to handle distributed user communities from many different institutions. –Many DIRAC users will already be familiar with it. Its designed to handle multiple service providers with different operating policies. While the SAFE supports many features sites only need to adopt those that work with their normal way of working.
SAFE as a service Can use the BG/Q safe to provide a service for the whole of DIRAC –Host, install, maintain, modify where necessary. –Generates necessary reports and statistics for whole of DIRAC. –Provides single point to manage project membership, account creation etc. –Lightweight and non-intrusive integration with service providers. –Special handling to work within local policies. –Choice over which features are adopted. –Centralised service requires minimal changes to existing software and only needs O(N) interactions not O(N 2 )
Account creation. Accounts requested via SAFE –Sends request to project manager. –Once approved raises ticket with service provider –Default is to do this by , XML available for scripts. Hi Support, This user has been authorised to have an account on one of our machines. Please create a new user account for them using the following information. Task ID: Machine: hector Username: demo User's Name: Dr Stephen P Booth Consortium: z01 - USL Project Group(s): z01 UID: GID: 1001 Thanks, The SAF. P.S. You can see the current pending queue by looking at New User Pending :3: :00:00 hector z01 USL z demo z01 Dr Stephen Booth
Completing tickets. Once created need to notify SAFE via web-form –Manually via browser or automatically via script. –Service provider can reject tickets. –Initial (one-shot?) password returned to SAFE for retrieval by user. –Similar mechanism possible for password resets. We can gather more information if needed –IP address ranges has been requested. We can encode local policies on Usernames UID/GID ranges into SAFE. Or we can let site choose UID/GID/Username and return values to SAFE when completing ticket. –UID/GID only need to be managed centrally if supporting file-system cross mounts.
Accounting/Reports SAFE contains an extensive accounting sub-system. Accounting data is parsed into DB tables. –Do NOT mandate a fixed format instead keep data close to raw format and define mappings to standard properties. –Easier to change system/policy without re-importing old data. –Easier to handle different service provider policies –Single reports may combine data from multiple tables in different formats provided reports are based on common properties. Service providers only need to provide DIRAC usage data in some convenient format. –Normally upload data daily. –Can also support storage accounting though this does currently use a fixed format.
Resource Management Safe can provide more detailed resource management. Uses a 3 level model. 1.Project – Top level corresponds to a grant of resources from allocation panel mostly internal to SAFE 2.ProjectGroup – Internal project management grouping controlled by project PI or designated managers through web interface. These can be just compute budgets but may also correspond to unix groups if used to manage disk resources. 3.User – individual user. Though this gives a lot of fine control to PI/PM it requires more integration with service provider –Sites can choose to use local resource management procedures instead. –Accounting does NOT depend on SAFE managing the resources.