Progress on TeraGrid Stability for the LEAD project
History Reliability problems for LEAD –2006 Unidata workshop –Spring 2007 Weather Challenge –Continued “heroics” needed by staff every time more than a handful of users used the gateway 10/25/08 ARCH call topic raised by Dane “I think it's time to raise this discussion in the broader venue of the ARCH meeting. We have been raising the profile of this investigation and trying to come to a persistent resolution of the problem. We're putting attention on this among the management and it should be reflected to the working teams. There also continue to be misunderstanding and different expectations and it would be good to set those clearly.”
Gateway-debug calls initiated 10/25/07 Goal –Stable systems for LEAD to conduct student Weather Challenge with 67 universities Runs start 1/28/08 –Improve stability of grid services for all users at all TeraGrid sites Eliminate need for staff heroics
Get the right staff on the gateway- debug calls Original request for –knowledgeable LEAD rep –knowledgeable Globus rep –knowledgeable NCSA RP rep –knowledgeable IU RP rep –knowledgeable GRAM rep –knowledgeable gridftp rep –knowledgeable Inca rep –knowledgeable TG operations rep
gateway-debug activities Understand the problems –Suresh creates
With some humor Overloaded GridFTP servers m/v/4wp3m1vg06Q&hl =enhttp:// m/v/4wp3m1vg06Q&hl =en
Create testbed where we can implement solutions rapidly –Only at sites LEAD was trying to use ANL, NCSA, IU Software and hardware configuration changes on the testbed –Non-striped GridFTP servers –Globus which includes GRAM scalability improvements –RFT improvements Develop tests that simulate what LEAD does –GRAM, GridFTP, javaCOG
Inca Use Inca to run LEAD tests –Inca run once per day on production sites Version tests, limited functionality tests –Frequency greatly increased for testbed Every 5 min. “are you alive” tests Once an hour “can I get a job into a queue” test –These can be tuned, back off when a service proves it is stable –Automatic admin notification –These last two were the key!!
Inca results reviewed at each call 085/cgi-bin/lead.cgihttp://cuzco.sdsc.edu:8 085/cgi-bin/lead.cgi –Still lots of errors this past week Summary sent before gateway-debug –Issues addressed on the call –Follow-up on actions from previous week
Gateway-debug work moving to ops-wg Maintain testbed –For now, maintain as stable infrastructure for LEAD Having trouble today with testbed stability –In the future Use testbed and Inca structure to verify reliability of new versions of CTSS before it goes into production Improve simulated scalability tests and produce benchmark (before asking Users/Gateways to participate) Turn focus on production systems –Increase testing frequency enough to be able to determine stability Once per day is not enough –Automatic notification of sys admins
Let’s learn from this experience Increased testing Automatic sys admin notification Having the right staff on the calls as needed Weekly reviews of test The above items are what moved us along We need to continue paying attention if we expect to have a stable environment for Gateways and users of grid services Stay tuned for progress in ops-wg
Thank You To lots of folks, but especially Suresh Marru Doru Marcusiu Kate Ericson, Shava Smallen Derek Simmel, Robert Budden Mike Lowe, Jenette Tillotson Stu Martin, Dan Fraser Raj Kettimuthu, John Bresnahan Ravi Madduri