Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003.

Slides:



Advertisements
Similar presentations
Distributed Systems Topics What is a Distributed System?
Advertisements

Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
1 GENI: Global Environment for Network Innovations Jennifer Rexford Princeton University
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space User Oriented Provisioning of Secure Virtualized.
The middleware that makes real time integration a reality.
Logical Architecture and UML Package Diagrams
Distributed System Concepts and Architectures Summary By Srujana Gorge.
 Separating system’s concerns from programmer’s concerns  Language constructs for programming distributed systems  Transparency to various system dependent.
© Hitachi Data Systems Corporation All rights reserved. 1 1 Det går pænt stærkt! Tony Franck Senior Solution Manager.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Cracow Grid Workshop’10 Kraków, October 11-13,
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Team Members Lora zalmover Roni Brodsky Academic Advisor Professional Advisors Dr. Natalya Vanetik Prof. Shlomi Dolev Dr. Guy Tel-Zur.
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Univ. Notre Dame, September 25, 2003 Support for Run-Time Adaptation in RAPIDware Philip K. McKinley Software Engineering and Networking Systems Laboratory.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
High-Availability Linux.  Reliability  Availability  Serviceability.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Service Architecture of Grid Faults Diagnosis Expert System Based on Web Service Wang Mingzan, Zhang ziye Northeastern University, Shenyang, China.
Cluster Reliability Project ISIS Vanderbilt University.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
Model-Driven Analysis Frameworks for Embedded Systems George Edwards USC Center for Systems and Software Engineering
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
A Node and Load Allocation Algorithm for Resilient CPSs under Energy-Exhaustion Attack Tam Chantem and Ryan M. Gerdes Electrical and Computer Engineering.
Scott Butson District Technology Manager. Provide professional to all district staff Professional development has been provided on a regular basis to.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.
 High-Availability Cluster with Linux-HA Matt Varnell Cameron Adkins Jeremy Landes.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
A Personal Cloud Controller Yuan Luo School of Informatics and Computing, Indiana University Bloomington, USA PRAGMA 26 Workshop.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
A Fault Tolerant Control Approach to Three Dimensional Magnetic Levitation By James Ballard.
Plumbing the Computing Platforms of Big Data Dilma Da Silva Professor & Department Head Computer Science & Engineering Texas A&M University.
7. Grid Computing Systems and Resource Management
Phoenix Convention Center Phoenix, Arizona Transactive Energy in Building Clusters [Innovation][Regional Innovation in Arizona] Teresa Wu Arizona State.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
1 Paul Sheldon Physics & Astronomy Paul Sheldon Physics & Astronomy Welcome! Workshop on High Performance, Fault Adaptive Large Scale Real-Time Systems.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Microsoft Cloud Solution.  What is the cloud?  Windows Azure  What services does it offer?  How does it all work?  How to go about using it  Further.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
NASA Earth Exchange (NEX) Earth Science Division/NASA Advanced Supercomputing (NAS) Ames Research Center.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Enterprise Architectures
Software Hardware refers to the physical devices of a computer system.
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Dynamo: A Runtime Codesign Environment
Software Defined Storage
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Algorithms for Big Data Delivery over the Internet of Things
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Anne Pratoomtong ECE734, Spring2002
Windows Server 2016 Software Defined Storage
Virtualization, Cloud Computing, and TeraGrid
Mark McKelvin EE249 Embedded System Design December 03, 2002
SmartAnythingEverywhere Initiative
Presentation transcript:

Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003

Cluster IEEE International Conference on Cluster Computing, Chicago, Sep Next 2 meetings are: December 2003 in Hong Kong September 2004 in San Diego Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings Motivation: The series of conferences and their sponsor, the Task Force for Cluster Computing (TFCC), were created to: Bring the together the cluster community Establish best practices Provide educational material Cross-fertilize ideas between industry and academia

Cluster 2002 Topics Running a cluster and making it usable Software for management, including configuration Middleware software Building a cluster Software and hardware for networking Choosing node hardware Packaging hardware Making use of a cluster New and innovative applications

Cluster 2002 Results and Conclusions Positives: Software tools are getting better - management, configuration and administration Interesting and promising work ongoing in: Self-tuning software Component redundancy Applications Clusters are enabling platforms due to low entry cost Negatives: Large (possibly heterogeneous) systems are not easy to build or maintain Systems administration is normally underestimated and un(der)funded Component failure in large systems can be a problem Other: Clusters are good for work for which we know they are good Minimum cost clusters can handle some jobs well Should design and build cluster to suit application needs

FALSE Workshop on Fault-Adaptive Large-Scale Real-Time Systems Held at Vanderbilt, Nov Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common to both meetings (Tony Skjellum and I) Motivation: High Energy Physics community wants to build systems to monitor experiments Others (DARPA, NASA) have an interest in similar systems An occasion to share knowledge and plan future research Topics: Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs) Novel approaches to achieving fault tolerance at low cost (< 10% overhead) How to make fault responses domain-specific (tools that enable the user to specify the response to different failures, and to implement these responses throughout the system) Results/Consensus No results from this initial meeting; just information sharing (w/ complete consensus)

General Thoughts Fault-Tolerance is becoming important to large- scale systems Embedded and non-embedded systems Real-time and non-real-time systems Is there a common solution (or partial solution) to this issue? “There is no software problem an additional layer of abstraction won’t solve”

Thanks Questions?