Download presentation
Presentation is loading. Please wait.
Published byClaude Fowler Modified over 9 years ago
1
TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch
2
Highlights TeraGrid surpassed 1 petaflops of aggregate computing. –Aggregate compute power available is 3.5x times from 2007 to 2008. –Primarily result of Track 2 systems at TACC and NICS coming online. –NUs used and allocated is ~4x times from 2007 to 2008. Significant improvement in the instrumentation, including tracking of grid usage and data transfers. Inca providing historical tracking of software and service reliability along with a new interface for both users and administrators. An international security incident touched TeraGrid, resulting in a very strong incident response as well as improved procedures for a new attack vector. Improvements in authentication procedures and cross- resource single-sign-on.
3
Big Picture Resource Changes Sun Constellation Cluster (Ranger) at TACC, Feb ’08 –Initially 504 Tflops; upgraded in July 2008 to ~63,000 compute cores and 580 Tflops Cray XT4 (Kraken) at NICS, Aug ’08 –166 Tflops and 18,000 computing core cores Additional resources that entered production in 2008: –Two Dell PowerEdge 1950 clusters: 668-node system at LONI (QueenBee) and the 893-node system at Purdue (Steele) –PSC’s SGI Altix 4700 shared-memory NUMA system (Pople) –FPGA-based resource at Purdue (Brutus) –Remote visualization system at TACC (Spur) Other improvements: –Condor Pool at Purdue also grew from 7,700 to more than 22,800 processor cores. –Indiana integrated its Condor resources with the Purdue flock, simplifying use. Decommissioned systems: –NCSA’s Tungsten, PSC’s Rachel, Purdue’s Lear, SDSC’s DataStar and Blue Gene, and TACC’s Maverick.
4
TeraGrid HPC Usage, 2008 3.8B NUs in Q4 2008 Kraken, Aug. 2008 Ranger, Feb. 2008 3.9B NUs in 2007 In 2008, Aggregate HPC power increased by 3.5x NUs requested and awarded quadrupled NUs delivered increased by 2.5x In 2008, Aggregate HPC power increased by 3.5x NUs requested and awarded quadrupled NUs delivered increased by 2.5x
5
TeraGrid Operations Center Created 7,762 tickets Resolved 2,652 tickets (34%) Took 675 phone calls Resolved 454 phone calls (67%) Manage TG ticket system and 24x7 toll-free call center Respond to all users and provide front-line resolution if possible -34% resolution rate Route remaining tickets to RP sites and other second-tier resolution centers Maintain situational awareness across the TG project (upgrades, maintenance, etc.)
6
Instrumentation and Monitoring Monitoring and statistics gathering for TG services –E.g. Backbone, Grid Services (GRAM, GridFTP) Used for measuring adoption, detecting problems, resource provisioning.
7
Inca Grid Monitoring System Automated, user-level testing to improves reliability by detecting Grid infrastructure problems. –Provides detailed information about tests and their execution to aid in debugging problems. Originally designed for TeraGrid, used in other large-scale projects including ARCS, DEISA, and NGS. Improvements in 2008 include new version of the Inca Web server, which provides for custom views of latest results. –The TeraGrid User Portal uses custom view of SSH and batch job tests in resources viewer. Added email notification upon test failures. New historical views were created to summarize overall data trends. Developed a plug-in that allows Inca to recognize scheduled downtimes 20 new tests written and 77 TeraGrid tests were modified. 2,538 pieces of test data are being collected.
8
TeraGrid Backbone Network Provides dedicated high-speed interconnect between TG high-end resources. TeraGrid 10 Gb/s backbone runs from Chicago to to Denver to Los Angeles. Contracted from NLR. Dedicated 10 Gb/s link(s) from each RP to one of the three core routers. Map image from Indiana University
9
Security Gateway Summit to develop understanding of security needs of RPs and Gateways. –Co-organized with Science Gateways team –30 attendees for RP sites and Gateways User Portal Password Reset Procedure Risk Assessments for Science Gateways and User Portal TAGPMA participation and leadership Uncovered large-scale attack in collaboration with EU Grid partners. –Established secure communications: Secure Wiki, SELS
10
Single Sign-on Java-based GSI-SSHTERM application added to User Portal –Consistently in top 5 apps. –Augments command-line functionality already in place. Replicating MyProxy CA at PSC to provide catastrophic failover for server at NCSA. –Implemented client changes on RPs and User Portal for failover. Developed a set of guidelines for management of grid identities (X.509 distinguished names) in the TeraGrid Central Database (TGCDB) and RP sites. –Tests written for TGCDB; Inca tests for RPs will follow. Started technical implementation of Shibboleth support for User Portal –TeraGrid now member of InCommon (as a service provider) –Will transfer to new Internet Framework.
11
END OF PRESENTATION Reference material and future plans slides for Towns follow.
12
Allocation Statistics
14
New Resources for 2009 NICS Kraken system was upgraded in February 2009 to a 66,048-core, 600-Tflops Cray XT5 system. NCSA placed the 192-node GPU-accelerated Dell PowerEdge 1950 cluster, Lincoln, into production. Further planned additions for 2009 include NCAR’s a Sun Ultra 40 system dedicated to data analysis and visualization.
15
Inca Plans for 2009 Integration of Inca into Internet Framework. Create interface for RP administratorss to execute tests on-demand. Integrate with ticket systems to connect tickets to tests. Start work on a Knowledge Base for errors, causes and solutions. Development and maintain views based on needs and output of QA and CUE groups.
16
SSO Plans for 2009 Complete PSC deployment of backup MyProxy service. Complete integration of Shibboleth support into Internet Framework –Develop full trust model for TeraGrid/Campuses –Start recruiting campuses and growing usage Work on bridging authorization with OSG and EGEE to support other activities.
17
Other Continuing Tasks TOC –24x7x365 point of contact –Trouble ticket creation and management Helpdesk –First tier support at RP sites integrated with TOC Instrumentation services Backbone network and network coordination Security Coordination, TAGPMA, etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.