SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Database Architectures and the Web
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
A Computation Management Agent for Multi-Institutional Grids
Cluster Technologies University of Macedonia Thessaloniki - Greece Parallel & Distributed Processing Laboratory
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Distributed components
Distributed Systems Architectures
City University London
A brief look at CORBA. What is CORBA Common Object Request Broker Architecture developed by OMG Combine benefits of OO and distributed computing Distributed.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
1/16/2008CSCI 315 Operating Systems Design1 Introduction Notice: The slides for this lecture have been largely based on those accompanying the textbook.
DISTRIBUTED COMPUTING
Operating Systems.
01 Introduction to Java Technology. 2 Contents History of Java What is Java? Java Platforms Java Virtual Machine (JVM) Java Development Kit (JDK) Benefits.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
DISTRIBUTED COMPUTING
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
CS 501: Software Engineering Fall 1999 Lecture 12 System Architecture III Distributed Objects.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Shuman Guo CSc 8320 Advanced Operating Systems
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
WebFlow High-Level Programming Environment and Visual Authoring Toolkit for HPDC (desktop access to remote resources) Tomasz Haupt Northeast Parallel Architectures.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CSC 480 Software Engineering Lecture 17 Nov 4, 2002.
Checkpointing Facility on a Metasystem Yudith Cardinale and Emilio Hernández Universidad Simón Bolívar Caracas, Venezuela.
Background Computer System Architectures Computer System Software.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
An Introduction to GPFS
© 2010 VMware Inc. All rights reserved Why Virtualize? Beng-Hong Lim, VMware, Inc.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
SERVERS. General Design Issues  Server Definition  Type of server organizing  Contacting to a server Iterative Concurrent Globally assign end points.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Chapter 1: Introduction
Introduction to Distributed Platforms
Grid Computing.
CSC 480 Software Engineering
CHAPTER 3 Architectures for Distributed Systems
University of Technology
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Chapter 1: Introduction
Replication Middleware for Cloud Based Storage Service
Chapter 17: Database System Architectures
Support for ”interactive batch”
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Subject Name: Operating System Concepts Subject Number:
Introduction To Distributed Systems
Database System Architectures
Calypso Service Architecture
Presentation transcript:

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García, Pedro Universidad Simón Bolívar

MOTIVATION  In 1999, a couple of projects from USB received funding from a strategic alliance between the government and the Oil industry (Agenda Petróleo): one from Chemistry and another from Geophysics.  We wanted to build a system that  Provides uniform access, from researchers’ desktop computers, to campus distributed heterogeneous resources  Efficiently supports high level scientific programming  Offers evolved services (performance, fault tolerance, specialized clients)

BASIC FEATURES  Execution architecture composed by (heterogeneous) clusters, workstations, specialized hardware, loosely interconnected  Executes Java byte code, both sequential and parallel  Support for fault tolerance and recovery  Provides for efficient execution and performance modeling  Built on standard, flexible, and portable platforms: Java, CORBA, OO approach

CONTENTS  Execution basics  System architecture  SUMA Components  Performance  Fault Tolerance  Parallel execution  Current prototype  Conclusions and future work

Execution basics  Two execution modes:  On-line: a program is supplied to SUMA (e.g. SUMAjava main.class). Input and output are redirected to the client machine from the remote node  Off-line: all classes and input files needed by the application are packed and delivered for execution. Results can be obtained later  A number of execution attributes can be provided along with the program. For instance, scheduling constraints, classes and data files to be preloaded, etc.

Execution basics (inside SUMA)  Once the main class of the program for execution is given to SUMA from a client machine  transparently, SUMA finds a server (i.e., a cluster or machine) for execution and sends a request message to that server.  an execution agent at the designated server starts execution of the program, dynamically loading the required classes and input data from the client, as well as sending back the output.  in case of off-line jobs, output is kept in SUMA until requested by the user.

System architecture

Core

SUMA Components: Core Engine:  Coordinates execution  Receives Execution Unit object from client stub  Checks for permission  Asks scheduler for suitable server  Delivers Execution Unit to designated server  Interacts with Application Monitor  Handles results, in case of off-line jobs

SUMA Components: Core Scheduler:  Obtains status and load information from the servers.  Responds to the Engine requests, based on the applications’ requirements.  Maintains load balance between the servers.

SUMA Components: Core Application Monitor:  Consists of a Coordinator and several Application Monitor Slaves  Receives status information from Execution Agents (crash, exit,...)  Provides information for implementing  Fault Tolerance (based on checkpointing and recovery)  Performance modeling and profiling

SUMA Components: Execution Agent  One per server, concurrent.  Registers itself in the Resource Control  Executes programs  Receives Execution Unit from the Engine.  Starts execution, possibly loading classes and files dynamically from the client.  Sends result to the client.  For a parallel platform, the Execution Agent plays the role of the front end.

SUMA Components: Administration  Resource control:  Used for registration of SUMA resources, i.e., servers.  Keeps static and dynamic information about the servers, such as memory size, available libraries, load, etc.  User control:  Used for user registration.  Allows user authentication.

SUMA Components: Client stub  The client stub is a library for SUMA clients implementation.  Provides services for on-line and off-line execution, retrieving results and performance profiles.  Creates and delivers the Execution Unit and Information Unit.  Serves callbacks from Execution Agents.  Two types of clients: User and Administrator.

Performance SUMA optimizations  Keep pool of processes at servers, with pre-loaded virtual machines  Remote class loading and pre-loading  Compiling to native code at servers  Others (see Parallel execution)

Performance Application performance feedback  Provides the user with relevant information concerning performance of application execution (e.g., architecture, etc.)  Allows for performance tuning, architecture selection, etc.

Fault Tolerance  At two levels  SUMA level, by replicating SUMA components.  Execution server level, by providing checkpointing and recovery, both sequential and parallel.

Parallel execution  Parallel platforms in SUMA are predefined clusters.  A parallel platform must provide:  MPI  Numerical libraries  Support for executing parallel Java applications with calls to mpiJava.

Parallel execution: services  mpiJava is a group of Java classes that allow us to call a native implementation of MPI (1.1) from Java.  plapackJava is a set of Java classes that allows users to call the functions of PLAPACK from Java  plapackSUMA and mpiSUMA are implementations of the libraries above using Cygnus Java compiler

Parallel execution: experiment  Results of comparing execution of PLAPACK interfaces (Java and C implementations)  The experiment consists of solving a linear algebra problem (LU factorization) on a cluster of 8 Pentium II (400 MHz) with 512 Mbytes of RAM, connected with 100 Mbps Ethernet

Current prototype  Centralized Core, public domain CORBA implementation (JacORB 1.14), JDK 1.2, Cygnus compiler.  Implementations of mpiJava on LAM, for Linux.  Straightforward scheduling and fault tolerance.  Runs on Solaris and Linux

Experiments and results

Conclusions and future work  Basic, expandable, flexible platform for executing Java bytecode, with support for efficient parallel execution  Long list of future developments.We will focus on fault tolerance, and performance tuning and modeling