Software Integration Highlights CY2008 Lee Liming, JP Navarro GIG Area Directors for Software Integration University of Chicago, Argonne National Laboratory April
Expanding TeraGrid Capabilities Moving capabilities from working groups to production –Help working groups define new TG-wide capabilities (SGW support, Lustre & GPFS WAN, scheduling, etc.) –Formally document new/enhanced capabilities and work out integration, testing, support details –Prepare software binaries and installers for TG systems Operated central services –Information service –Software build & test service –Speed page (data movement performance monitor) –DMOVER and Lustre WAN Initiated the Quality Assurance activity –Predecessor to QA/CUE working group 2
Capability Model In 2005, we retooled our SW coordination process –Emphasis on use cases, user scenarios enabled by SW –Bottom up, user-driven capability model –Open processes for community input into system def Original TeraGrid (DTF) was aimed at a narrow set of distributed HPC applications –Single platform, narrow user base and target uses (distributed HPC) –Heavy emphasis on identical software environment By 2004 commissioning, TeraGrid had expanded in scope to cover all NSF HPC applications –Very diverse user community, resources –Very wide diversity of user scenarios and use patterns 3
CTSS 4 – A New Paradigm Significant change in how we define CTSS –CTSS 1 thru 3: Monolithic software stack –CTSS 4: Modular user capabilities –Improved many aspects of capability delivery Better descriptions of the capabilities (esp. for users) Better documentation Clearer availability information More focused delivery process (package, deploy, and configure) Improved process for RPs to select and publish their offerings Delivery timeline –Designed in 2006 –Capability kits defined in Q1-Q –Capabilities rolled out in Q2-Q April
5 CTSS Capability Kits (April 2009) Capability Kit (14)Description TeraGrid Core IntegrationMinimal components that integrate RP resources Remote LoginRemote login using TeraGrid credentials to a coordinated Unix environment Remote ComputeRemote job submission Application Development and Runtime Compile and execute applications Data ManagementCollaborative data management capabilities Data MovementData movement to/from RP resources Parallel Application SupportIdentify and configure MPI runtime environment Science Gateway Support new!End-user counting, improved security for gateways On-demand Computation new!On-demand (little or no wait) computing Co-scheduling new!Reserving a set of resources for use at a specific time Science WorkflowRun an orchestrated collection of interdependent jobs Wide Area GPFSLocal access to TeraGrid wide GPFS filesystems Wide Area Lustre new!Local access to TeraGrid wide Lustre filesystems VisualizationCompile and execute visualization applications
2008 Availability & Usage Key idea: Capability usage vs. component usage Most CTSS capabilities were available on all (or nearly all) TG systems and were used heavily or frequently everywhere –Remote compute was used heavily on some systems (like those appropriate for SGW usage) and not on others –Visualization capability was used heavily at UC/Argonne and TACC (other TG resources offer diverse visualization capabilities) –Science workflow capability was used less than once/day, but each use generated 100s or 1000s of jobs Heavy use means more than 100 uses/day on a single system. Frequent use means 1 – 100 uses/day on a single system. Infrequent use means less than 1 use/day on a single system. 6
2008 Operational Issues In 2008, CTSS comprised 10 separate capabilities, with ~80 software components on 19 platforms 16 issues reported by RPs –Installation docs incorrect/incomplete –A GIG-provided installer doesn’t fit well with a system –Issues with specific components (as provided by developers) –Inca test not accurate in all situations –Enhancement requests from admins 7
Capability Development & Expansion VM hosting services supports science teams that utilize highly tailored environments or service-oriented applications –Provided by IU Quarry and Purdue Wispy Science gateway support enables end-user tracking and improved security for gateways –Defined and on track for PY4 availability Client software distribution supports campus champions and related initiatives –Released for evaluation Public build/test system supports NSF SDCI/STCI and CISE program awardees –on track for PY4 availability 8
Advanced Scheduling Capabilities Documented designs and implementations for TeraGrid advanced scheduling capabilities –On-demand computation –Advance reservation –Co-scheduling Broadened availability of new capabilities –On-demand at IU, NCAR, NCSA, SDSC, TACC, and UC/Argonne –Advance reservation and co-scheduling at LONI, NCSA, SDSC Automatic resource selection –In development, still on schedule for end of PY4 9
Information Services Enhancements TeraGrid’s Integrated Information Service is a vital communication channel for system-wide functions –Used by Inca to plan verification tests –Helps keep user documentation up-to-date –Provides queue status data for user portal monitors –Provides data for automatic resource selection –Configures speed page test runs –In general, enables automation of many routine housekeeping tasks Expanded content –Local HPC software registry, SGW-available science tools, resource descriptions Expanded access methods –REST application framework, multiple data formats 10
Questions? Moving capabilities from working groups to operations –Helping WGs move from ideas to production support –Capability-oriented software coordination model –Integration, testing, support planning –Preparing software for deployment on TG resources Specific capabilities –Advanced scheduling capabilities –Information services enhancements –Enhanced science gateway security, end user tracking –VM hosting for highly specialized or service-oriented applications –Software for campuses –Helping SDCI/STCI and CISE awardees prepare software for TG 11