European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Experiences.

Slides:



Advertisements
Similar presentations
Three types of remote process invocation
Advertisements

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid.
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Scalability.
Multiple Processor Systems
Vrije Universiteit Interdroid: a platform for distributed smartphone applications Henri Bal, Nick Palmer, Roelof Kemp, Thilo Kielmann High Performance.
Vrije Universiteit Interdroid: a platform for distributed smartphone applications Henri Bal, Nick Palmer, Roelof Kemp, Thilo Kielmann High Performance.
Distributed Processing, Client/Server and Clusters
7 april SP3.1: High-Performance Distributed Computing The KOALA grid scheduler and the Ibis Java-centric grid middleware Dick Epema Catalin Dumitrescu,
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Distributed Processing, Client/Server, and Clusters
Distributed components
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal, Jason Maassen, Rob van Nieuwpoort, Thilo Kielmann, Niels Drost, Ceriel Jacobs, Frank.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
1. Introducing Java Computing  What is Java Computing?  Why Java Computing?  Enterprise Java Computing  Java and Internet Web Server.
Grid Adventures on DAS, GridLab and Grid'5000 Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Ibis: a Java-centric Programming Environment for Computational Grids Henri Bal Vrije Universiteit Amsterdam vrije Universiteit.
Parallelization and Grid Computing Thilo Kielmann Bioinformatics Data Analysis and Tools June 8th, 2006.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
This work was carried out in the context of the Virtual Laboratory for e-Science project. This project is supported by a BSIK grant from the Dutch Ministry.
DISTRIBUTED COMPUTING
Enterprise Java Beans Java for the Enterprise Server-based platform for Enterprise Applications Designed for “medium-to-large scale business, enterprise-wide.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
OS2- Sem ; R. Jalili Introduction Chapter 1.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
CORBA1 Distributed Software Systems Any software system can be physically distributed By distributed coupling we get the following:  Improved performance.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka Hideo Saito Kei Takahashi Kenjiro Taura (University of Tokyo) {kenny,
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
 CMS data challenges. The nature of the problem.  What is GMA ?  And what is R-GMA ?  Performance test description  Performance test results  Conclusions.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
1 OASIS Team, INRIA Sophia-Antipolis/I3S CNRS, Univ. Nice Christian Delbé Data Grid Explorer 15/09/03 Large Scale Emulation Mobility in ProActive.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Designing a Middleware Server for Abstract Database Connection.
Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.
Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.
Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
High level programming for the Grid Gosia Wrzesinska Dept. of Computer Science Vrije Universiteit Amsterdam vrije Universiteit.
Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri.
Last Class: Introduction
Scaling Network Load Balancing Clusters
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Real-World Distributed Computing with Ibis
CHAPTER 3 Architectures for Distributed Systems
University of Technology
Advanced Operating Systems
Parallel and Multiprocessor Architectures – Shared Memory
MPJ: A Java-based Parallel Computing System
Atlas: An Infrastructure for Global Computing
Database System Architectures
Parallel programming in Java
Presentation transcript:

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Experiences Deploying Parallel Applications on a Large-scale Grid Rob van Nieuwpoort, Jason Maassen, Andrei Agapi, Ana-Maria Oprescu, Thilo Kielmann Vrije Universiteit, Amsterdam

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

H.M. Beatrix, Koningin der Nederlanden

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The N-Queens Contest Challenge: solve the most board solutions within 1 hour Testbed: –Grid5000, DAS-2, some smaller clusters –Globus, NorduGrid, LCG, ??? –In fact, there was not too much precise information available in advance...

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Computing in an Unknown Grid? Heterogeneous machines (architectures, compilers, etc.) –Use Java: write once, run anywhere Use Ibis! Heterogeneous machines (fast / slow, small / big clusters) –Use automatic load balancing (divide-and- conquer) Use Satin! Heterogeneous middleware (job submission interfaces, etc.) –Use the Grid Application Toolkit (GAT)!

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Ibis Grid Programming System

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Ibis System Java centric: write once, run anywhere Efficient communication (pure Java or native) Parallel programming models: –RMI (remote method invocation) –GMI (group method invocation) Collective communication (MPI-like) and more –RepMI (replicated method invocation) Strong consistency –Satin (divide and conquer) –MPJ (Java binding for MPI)

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Satin: Divide-and-conquer Effective paradigm for Grid applications (hierarchical) Satin: Grid-friendly load balancing (aware of cluster hierarchy) Also support for –Fault tolerance –Malleability –Migration

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Satin Example: Fibonacci fib(1) fib(0) fib(4) fib(1) fib(2) fib(3) fib(5) fib(1) fib(2) class Fib { int fib (int n) { if (n < 2) return n; int x = fib(n-1); int y = fib(n-2); return x + y; } Single-threaded Java

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Satin Example: Fibonacci public interface FibInter extends ibis.satin.Spawnable { public int fib (int n); } class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) { if (n < 2) return n; int x = fib(n-1); /*spawned*/ int y = fib(n-2); /*spawned*/ sync(); return x + y; } (use byte code rewriting to generate parallel code) Leiden Delft Rennes Inter net Sophia

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Satin: Grid-friendly load balancing (aware of cluster hierarchy) Random Stealing (RS) –Provably optimal on a single cluster (Cilk) –Problems on multiple clusters: (C-1)/C % stealing over WAN Synchronous protocol Satin: Cluster-aware Random Stealing (CRS) –When idle: Send asynchronous steal request to random node in different cluster In the meantime steal locally (synchronously) Only one wide-area steal request at a time

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Summary: Ibis Java: write once, run anywhere Ibis provides efficient communication among parallel processes Satin provides highly-efficient load balancing among nodes of multiples clusters But how do we deploy our Ibis / Satin application?

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Grid Application Toolkit (GAT)

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Grid Application Toolkit (GAT) Simple and uniform API to various Grid middleware: –Globus 2,3,4, ssh, Unicore,... Job submission, remote file access, job monitoring and steering Implementations: –C, with wrappers for C++ and Python –Java

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Beatrix Architecture (or: how we thought we could get away with)

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies The Beatrix Architecture (what we finally ended up with)

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Why all these changes? Scalability –DAS-2 has O(100) nodes –Grid5000 has O(1000) nodes Network connectivity –Theory meets practice The is in the details...

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Changing Satin: Scalability Satin uses random work stealing (CRS): Requires connections between all Ibis nodes Old solution: all nodes connect to each other at startup time –previously, firewalls have signaled this as Denial of Service attack Problem: connections fail due to TCP server socket backlog –typically, 50 pending connection requests are queued –the rest will fail New solution: connect on demand with steal requests

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Changing the Ibis Name Server: Scalability Ibis name server provides totally ordered joins –Broadcast join messages to all nodes –Old: 1-by-1, sequentially to all nodes –Problem: one hanging connection slows down all nodes –New: message combining and multiple, sending threads

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Changing the TCP Driver: Connectivity Nodes advertise their IP address (to connect to) to the name server: Grid5000 nodes have up to 5 IP addresses: which one? –Global and local (RFC1918), like 10.x.y.z –Address registered in DNS –Some are routed, some are not Problem: incomplete and inconsistent information Solution: manual configuration, TCP-Ibis tries out multiple addresses

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Results achieved on the Grid5000 Testbed Solved n=22 in 25 minutes 4.7 million jobs, 800,000 load balancing messages

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies More Results...

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Conclusions We used a mix of software: –Ibis with Satin –Java GAT with ProActive and ssh adaptors Scaling from 100 to 1000 processors required some redesign of Ibis and Satin Connectivity in Grid5000 is non-trivial (we still cannot connect Grid5000 with DAS-2) Ibis / Satin / GAT / ProActive has shown to be a viable and efficient Grid computing platform