TeraGrid’s Common User Environment: Status, Challenges, Future Annual Project Review April, 2008
The Pros and Cons of Uniformity We want uniformity for… –Things that are visible to users and that users expect to be uniform (e.g., allocations process, remote compute interfaces, user portal) –Things that make cross-site operations more efficient (e.g., accounting, usage monitoring) We do not want uniformity for… –Things that inhibit specialization of service offerings (e.g., system architectures, operational models) –Things where uniformity doesn’t benefit users These categories are not mutually exclusive. Our solution is by necessity a balance between uniformity and diversity.
A Taxonomy of Commonality Adjectives –Common - we make it the same everywhere (aka uniform) –Coordinated - we define a common method, establish goals, and communicate where the common method is available –Uncoordinated - we make little or no attempt to define, enforce, or communicate commonalities and differences Nouns –User Experience - allocations process, single sign-on, accounting, ticket system, CTSS, user portal, knowledge base, user documentation –Execution/Runtime Environment - compilers, processors, libraries, command-line tools, shells, file systems –Environment Discovery/Manipulation Mechanism - capability registry (and other TG-wide catalogs), softenv/modules –Operational Practices - behind-the-scenes operation of TeraGrid as a distributed system
Community Capabilities Common definitions –Allows the TeraGrid community to define standard capabilities For example: compilers, remote login mechanism, remote compute mechanism, data movement mechanism, shared filesystems, data collections, metascheduling mechanism –Provides a terminology for use in directories (what’s available where) and agreements (what ought to be available where) –Also useful in discussions with other service providers (e.g., OSG, EGEE), verification/validation, documentation, allocation proposals Common registration –A uniform mechanism for TeraGrid service providers to advertise availability and configuration (what, where, how) of capabilities Supports provider autonomy and automation Mirrors the highly successful WWW/Google publish/index model –Openly accessible to users, providers, third parties High availability service (99.5% uptime) including commercial hosting Supports many access modes (HTML, Web 2.0, Web services)
User Experience What are the administrative elements of the user’s experience as a TeraGrid user? –Obtaining and managing an allocation –Tracking usage, accounting system –User portal features –TeraGrid documentation –TeraGrid knowledge base –TeraGrid-wide credentials and single sign-on –Help desk (800 number, ticket system) –User support (consulting and advanced support) –Education, outreach, training Tremendous consolidation toward common mechanisms in all areas above via the GIG, Core, and HPCOPS programs
Execution/Runtime Environment NSF solicitation process encourages diversity –Competitive environment puts an advantage on specialization and innovation –No standard architectural requirements –TeraGrid includes N processor types, M vendors, L unique platforms –We believe that this is the right approach for TeraGrid, given the high rate of innovation in HPC and the diversity of user requirements TeraGrid does not define or enforce a common (uniform) runtime environment –Shell environment, login ID/password, filesystems, compilers, debuggers/profilers, libraries TeraGrid does provide a coordination mechanism –CTSS application development & runtime support kit focuses on registration (publishing what, where, and how) as opposed to commonality –This information is important for users before, during, and after their allocations –Automation and autonomy are key to making this work, plus a dash of standardization (schema, controlled vocabulary)
CTSS: Coordinated TeraGrid Software and Services In DTF, “Common TeraGrid Software Stack” –ETF’s transition to diverse resources required change of model from software stack to coordinated capabilities One mandatory capability –TeraGrid Core Integration covers the pieces necessary to integrate a resource into the operational environment Many optional capabilities –E.g., data movement, remote login, workflow support –In practice, most capabilities are ubiquitous and heavily used (e.g., remote login) –A small number of capabilities are not widely available or not heavily used (e.g., data management) Open definition process, merit-based deployment process –Anyone can propose a capability definition –Resource providers only deploy what their users need/want –Result is a merit-based selection/evolution process
Manipulating the Runtime Environment Common mechanism –Since the DTF, TeraGrid has offered softenv to users for manipulating their runtime environment –Collaboration is currently exploring an alternate mechanism: “modules” Provides vital flexibility –Resource providers can offer alternate versions of tools, libraries and users can select the ones they need –This is a long-standing best practice among Unix hosts Integration –TeraGrid’s capability registry details defaults and how to access alternate capabilities on each system –Inca system also uses softenv/modules
TeraGrid “Coordination” Definition –What are we making common, and for what purpose? Definitions must focus on users and user capabilities –What are the key technical requirements? The essential elements needed to support the users Registered Participation –Which parts of the system have this commonality? The places designed to support the target users and use patterns –What are the local configuration details? The details needed to use the capability on a specific system
Key Moments - Runtime Environment First DTF software stack (CTSS 1) defined –Common runtime: OS, compilers, libraries, software stack –Common mechanism for runtime environment customization ETF stresses common software stack –Diverse runtime, still attempting common software stack (painfully) Operational TG –Common software stack is characterized by many exceptions 2006/ Transition to coordination model –Software stack model becomes obsolete –First formal capability definitions drafted –Capability registry deployed and populated by RPs
Key Moments - Environment Discovery/Manipulation DTF chooses softenv –Common mechanism for describing options and selecting from alternatives within each system 2006/ Capability registry added –Data about capability availability now accessible system- wide (local registries, central index) 2008? - Is softenv a requirement? –Re-examining user requirements
Key Moments - User Experience DTF –Participation in existing allocation process –Common accounting (with technical issues) –Mostly separate user support, EOT activities ETF –Integrated allocation process –Better central accounting (still issues) Operational TG –Integrated user support –Coordinated user documentation –Coordinated EOT Addition of User Portal, Knowledge Base
Coordinated Operational Practices Usage tracking V&V Issue resolution Incident response Network Education/Outreach/Taining Vulnerability analysis