Report on 2002 Fault Tolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories
Motivation Large COTS systems are prone to failures »Lots of parts; complex configurations »Applications stress the systems »Few options for application survival University resources are untapped »DOE researchers unfamiliar with fault tolerance experts »University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.
Basic Info June 10-11, 2002 in Albuquerque, NM ~40 attendees »Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin Interest exceeded capacity Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) Sponsored by the CSRI
Agenda 11 invited talks + 2 hours focused discussion on: »Application descriptions and needs »System monitoring »MPI fault tolerance »Traditional approaches with a twist Topics not covered »Checkpoint-free algorithms »Preventative measures »System services »Migration »Redistribution »Validation »Run-time environments
Conclusions MPI support is needed »Programming model needs to be considered »Balance research with timely delivery of capabilities New ideas are needed »Leverage hardware »More systematic, integrated approach There are still outstanding issues »Transparency vs. intrusiveness »Can traditional approaches be made scalable? Workshop was a great success!
For more information…