NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Slides:

Advertisements

Similar presentations

--- IT Acumens. COMIT Acumens. COM SNMP Project. AIM The aim of our project is to monitor and manage the performance of a network. The aim of our project.

Advertisements

Multiple Processor Systems

Distributed Systems Topics What is a Distributed System?

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.

Distributed Processing, Client/Server, and Clusters

Distributed components

Chapter 1: Introduction

Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

City University London

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Application-specific Tools Netsolve, Ninf, and NEOS CSE 225 Chas Wurster.

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.

1/16/2008CSCI 315 Operating Systems Design1 Introduction Notice: The slides for this lecture have been largely based on those accompanying the textbook.

NetSolve / GridSolve By Milan Novakovic, Steven Morgan.

DATABASE MANAGEMENT SYSTEMS 2 ANGELITO I. CUNANAN JR.

VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Speaker: Xin Zuo Heterogeneous Computing Laboratory (HCL) School of Computer Science and Informatics University College Dublin Ireland International Parallel.

Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.

Internet Information Services 7.0 Infrastructure Planning and Design Series.

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

CH2 System models.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari.

Data Analysis using Java Mobile Agents Mark Dönszelmann, Information, Process and Technology Group, IT, CERN ATLAS Software Workshop Analysis Tools Meeting,

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.

1 Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Micah Beck Jack Dongarra Terry Moore James Plank University.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Problem Solving with NetSolve Michelle Miller, Keith Moore,

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

CENTRALISED AND CLIENT / SERVER DBMS. Topics To Be Discussed………………………. (A) Centralized DBMS (i) IntroductionIntroduction (ii) AdvantagesAdvantages (ii)

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.

Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.

Data-Centric Systems Lab. A Virtual Cloud Computing Provider for Mobile Devices Gonzalo Huerta-Canepa presenter 김영진.

John R Durrett1 Client/Server Computing Byte April 1995 & The Martian C/S book.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Applied Operating System Concepts

Chapter 1: Introduction

Chapter 1: Introduction

Chapter 1: Introduction

Steven Whitham Jeremy Woods

Chapter 1: Introduction

#01 Client/Server Computing

Chapter 1: Introduction

Distributed System Concepts and Architectures

Outline Midterm results summary Distributed file systems – continued

Fault Tolerance Distributed Web-based Systems

Operating System Concepts

Chapter 1: Introduction

Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.

Chapter 1: Introduction

Chapter 1: Introduction

Chapter 1: Introduction

Database System Architectures

Chapter 1: Introduction

Operating System Concepts

Distributed Systems and Concurrency: Distributed Systems

Chapter 1: Introduction

#01 Client/Server Computing

Presentation transcript:

NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Objectives Harnessing vast computational resources on the network Hardware Software Convenient for scientific computing community Reducing installation and programming overhead Masking complexity related to distributed computing

Computation-Sharing Models Proxy Computing Data Code Data Code Client Server Computation on the server

Computation-Sharing Models Code Shipping Code Data ClientServer Computation on the client Code

Computation-Sharing Models Remote Computation Data ClientServer Computation on the server Code

Design issues Platform independence to accommodate heterogeneity User friendly Extensibility Load balancing Fault tolerance

NetSolve Architecture “OS” Resources

NetSolve Organization and Operation

NetSolve Client Interface C, Fortran, Java, Matlab, and Mathematica >> a = rand(100); b= rand(100,1); >> x = netsolve(’ax = b’, a, b); >> a = rand(100); b= rand(100,1); >> request = netsolve_nb (’send’, ’ax = b’, a, b); >> x = netsolve_nb(’probe’, request); Not ready >> x= netsolve_nb(’wait’, request);

NetSolve Wrappers Problem description file for Parallel Sub-Surface Flow STRING CHAR FILE CHAR infile Compiled into wrappers around scientific libraries XDR for platform-independent data transfer

NetSolve Load Balancing Assigning a task to the “best” machine Establishing a performance model Network delay, server properties, task properties Measuring and monitoring dynamic system states Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

NetSolve Fault Tolerance Inter-server fault tolerance Fault tolerance among NetSolve servers Intra-server fault tolerance Fault tolerance within a NetSolve server

NetSolve Fault Tolerance Inter-server Fault Tolerance Performed by NetSolve agents Basic approach Failure detection + task reallocation Overload detection + task migration Introducing NetSolve storage servers Store checkpoints or any information related to fault tolerance (must be platform-independent) No reliance on failed or overloaded server for task migration

NetSolve Fault Tolerance Intra-server Fault Tolerance Not a new problem Could be invisible to NetSolve Can take advantage of platform-specific features for fault tolerance Possible integration with inter-server fault tolerance

Diskless Checkpointing Checksums and Reverse Computation Diskless checkpointing eliminates the need for stable storage N servers + a checkpointing server At any point, consistent checkpoints taken at N servers (stored in memory) A checksum of checkpoints stored at the checkpointing server Rollback using reverse computation State recovery using the checksum

Applications MCell with NetSolve Large code, small data Matlab with NetSolve Tradeoffs between parallelism and overhead IPARS with NetSolve ImageVision with NetSolve

Integration with ScaLAPACK

Integration with Condor

Integration with Ninf

Conclusion An interesting infrastructure for sharing computational resources Both software and hardware Convenience, performance, and reliability Playground for fault tolerance Both general and specific