Download presentation
Presentation is loading. Please wait.
Published byBrent Carter Modified over 9 years ago
1
Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003
2
Cluster 2002 http://www.mcs.anl.gov/cluster2002/ 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002 Next 2 meetings are: December 2003 in Hong Kong September 2004 in San Diego Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings Motivation: The series of conferences and their sponsor, the Task Force for Cluster Computing (TFCC), were created to: Bring the together the cluster community Establish best practices Provide educational material Cross-fertilize ideas between industry and academia
3
Cluster 2002 Topics Running a cluster and making it usable Software for management, including configuration Middleware software Building a cluster Software and hardware for networking Choosing node hardware Packaging hardware Making use of a cluster New and innovative applications
4
Cluster 2002 Results and Conclusions Positives: Software tools are getting better - management, configuration and administration Interesting and promising work ongoing in: Self-tuning software Component redundancy Applications Clusters are enabling platforms due to low entry cost Negatives: Large (possibly heterogeneous) systems are not easy to build or maintain Systems administration is normally underestimated and un(der)funded Component failure in large systems can be a problem Other: Clusters are good for work for which we know they are good Minimum cost clusters can handle some jobs well Should design and build cluster to suit application needs
5
FALSE 2002 http://false2002.vanderbilt.edu/ Workshop on Fault-Adaptive Large-Scale Real-Time Systems Held at Vanderbilt, 14-15 Nov. 2002 Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common to both meetings (Tony Skjellum and I) Motivation: High Energy Physics community wants to build systems to monitor experiments Others (DARPA, NASA) have an interest in similar systems An occasion to share knowledge and plan future research Topics: Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs) Novel approaches to achieving fault tolerance at low cost (< 10% overhead) How to make fault responses domain-specific (tools that enable the user to specify the response to different failures, and to implement these responses throughout the system) Results/Consensus No results from this initial meeting; just information sharing (w/ complete consensus)
6
General Thoughts Fault-Tolerance is becoming important to large- scale systems Embedded and non-embedded systems Real-time and non-real-time systems Is there a common solution (or partial solution) to this issue? “There is no software problem an additional layer of abstraction won’t solve”
7
Thanks Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.