Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003.

Similar presentations


Presentation on theme: "Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003."— Presentation transcript:

1 Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003

2 Cluster 2002 http://www.mcs.anl.gov/cluster2002/ 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002 Next 2 meetings are: December 2003 in Hong Kong September 2004 in San Diego Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings Motivation: The series of conferences and their sponsor, the Task Force for Cluster Computing (TFCC), were created to: Bring the together the cluster community Establish best practices Provide educational material Cross-fertilize ideas between industry and academia

3 Cluster 2002 Topics Running a cluster and making it usable Software for management, including configuration Middleware software Building a cluster Software and hardware for networking Choosing node hardware Packaging hardware Making use of a cluster New and innovative applications

4 Cluster 2002 Results and Conclusions Positives: Software tools are getting better - management, configuration and administration Interesting and promising work ongoing in: Self-tuning software Component redundancy Applications Clusters are enabling platforms due to low entry cost Negatives: Large (possibly heterogeneous) systems are not easy to build or maintain Systems administration is normally underestimated and un(der)funded Component failure in large systems can be a problem Other: Clusters are good for work for which we know they are good Minimum cost clusters can handle some jobs well Should design and build cluster to suit application needs

5 FALSE 2002 http://false2002.vanderbilt.edu/ Workshop on Fault-Adaptive Large-Scale Real-Time Systems Held at Vanderbilt, 14-15 Nov. 2002 Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common to both meetings (Tony Skjellum and I) Motivation: High Energy Physics community wants to build systems to monitor experiments Others (DARPA, NASA) have an interest in similar systems An occasion to share knowledge and plan future research Topics: Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs) Novel approaches to achieving fault tolerance at low cost (< 10% overhead) How to make fault responses domain-specific (tools that enable the user to specify the response to different failures, and to implement these responses throughout the system) Results/Consensus No results from this initial meeting; just information sharing (w/ complete consensus)

6 General Thoughts Fault-Tolerance is becoming important to large- scale systems Embedded and non-embedded systems Real-time and non-real-time systems Is there a common solution (or partial solution) to this issue? “There is no software problem an additional layer of abstraction won’t solve”

7 Thanks Questions?


Download ppt "Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003."

Similar presentations


Ads by Google