* QADPZ * An Open System for Distributed Computing Zoran Constantinescu 16-Jan-2003.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
Linux vs. Windows. Linux  Linux was originally built by Linus Torvalds at the University of Helsinki in  Linux is a Unix-like, Kernal-based, fully.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),
Lecturer: Sebastian Coope Ashton Building, Room G.18 COMP 201 web-page: Lecture.
Parallelized Evolution System Onur Soysal, Erkin Bahçeci Erol Şahin Dept. of Computer Engineering Middle East Technical University.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Introduction. Why Study OS? Understand model of operation –Easier to see how to use the system –Enables you to write efficient code Learn to design an.
Linux Networking CIS Why Linux/Unix? Configurability ▫Customizable System to satisfy unique needs. Scalability ▫Able to serve an increasing number.
EstiNet Network Simulator & Emulator 2014/06/ 尉遲仲涵.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Framework for Automated Builds Natalia Ratnikova CHEP’03.
1 Web Server Administration Chapter 1 The Basics of Server and Web Server Administration.
Using Distributed Computing in Computational Fluid Dynamics Zoran Constantinescu, Jens Holmen, Pavel Petrovic May 2003, Moscow Norwegian University.
Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
A Distributed Computing System Based on BOINC September - CHEP 2004 Pedro Andrade António Amorim Jaime Villate.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 14, 2005 Operating System.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
PackLet A web-based text messaging application using AX.25 packet radio technology.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 6 System Calls OS System.
HeuristicLab Hive An Open Source Environment for Parallel and Distributed Execution of Heuristic Optimization Algorithms S. Wagner, C. Neumüller, A. Scheibenpflug.
Example: Sorting on Distributed Computing Environment Apr 20,
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Chapter 3: Services of Network Operating Systems Maysoon AlDuwais.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Process Manager Specification Rusty Lusk 1/15/04.
Background Computer System Architectures Computer System Software.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 1: Network Operating Systems (NOS)
Introduction to Operating Systems Concepts
VMware ESX and ESXi Module 3.
CT1503 Network Operating System
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Network Operating Systems (NOS)
Chapter 1: Introduction
Constructing a system with multiple computers or processors
Chapter 1: Introduction
Overview Introduction VPS Understanding VPS Architecture
NCSA Supercluster Administration
Chapter 2: System Structures
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Types of Parallel Computers
STATEL an easy way to transfer data
Presentation transcript:

* QADPZ * An Open System for Distributed Computing Zoran Constantinescu 16-Jan-2003

2 need for more CPU supercomputers clusters of PCs distributed computing (grid computing) existing systems QADPZ project description advantages architecture application domains current status future work outline

3 problem(s) larger and larger amounts of data are generated every day (simulations, measurements, etc.) software applications used for handling this data are requiring more and more CPU power simulation, visualization, data processing complex algorithms, e.g. evolutionary algorithms large populations, evaluations very time consuming  need for parallel processing

4 use parallel supercomputers –access to tens/hundreds of CPUs e.g. NOTUR/NTNU embla (512) + gridur (384) –high speed interconnect, shared memory –usually for batch mode processing –very expensive (price, maintenance, upgrade) –these CPUs are not so powerful anymore e.g. 500 MHz RISC vs. 2.4 GHz Pentium4 solution1

5 use clusters of PCs (Beowulf) –network of personal computers –usually running Linux operating system –powerful CPUs (Pentium3/4, Athlon) –high speed networking (100 MBps, 1 GBps, Myrinet) –much cheaper than supercomputers –still quite expensive (upgrade, maintenance) –trade higher availability and/or greater performance for lower cost solution2

6 use distributed computing –using existing networks of workstations (PCs connected by LAN from labs, offices, etc.) –usually running Windows or Linux operating system (also MacOS, Solaris, IRIX, etc.) –powerful CPUs (Pentium3/4, Athlon) –high speed networking (100 MBps) –already installed computers – very cheap –easy to have a network of tens/hundreds of computers solution3

7 specialized client applications run on each individual computer they talk to one or more central servers download a task, solve it, and send back results more suited (easier) for task-parallel applications (where the applic. can be decomposed into independent tasks) can also be used for data-parallel applications the number of available CPUs is more dynamic distributed computing

8 home –search for extraterrestrial intelligence –analysis of data from radio telescopes –client application is very specialized –using the Internet to connect clients to server, and to download/upload a task –no framework for other applications –no source code available existing systems

9 distributed.net –one of the largest "computer" in the world (~20TFlops) –used for solving computational challenges: –RC5, Optimal Golomb ruler –client application is very specialized –using the Internet to connect clients to server –no framework for other applications –no source code available existing systems

10 Condor project (Univ.of Wisconsin) –more research oriented computational projects –more advanced features, user applications –very difficult to install, problems with some OSes (started from a Unix environment) –restrictive license (closed system) other commercial projects Entropia, Parabon existing systems

11 QADPZ project (NTNU) –initial application domains: large scale visualization, genetic algorithms, neural networks –prototype in early 2001, but abandoned (too viz oriented) –started in July 2001, first release v0.1 in Aug 2001 –we are now close to release v0.8 (Feb-Mar 2003) –system independent of any specific application domain –open source project on SourceForge.net a new system

12 qadpz.sourceforge.net

13 QADPZ project (NTNU) –similar in many ways to Condor (submit computing tasks to idle computers running in a network) –easy to install, use, and maintain –modular and extensible –open source project, implemented in C++ –support for many OSes (Linux, Windows, Unix, …) –support for multiple users, encryption –logging and statistics QADPZ description

14 QADPZ architecture client  master  slave clientmasterclientslave … … management of slaves scheduling tasks monitoring tasks controlling tasks background process download tasks/data executing tasks user interface submit tasks/data

15 parallelism task-parallelism ("coarse grain") multiple independent code segments/programs are run concurrently same initial data or different same code or different data-parallelism ("fine grain") same code runs concurrently on different data elements usually requires synchronization (better network)

16 e.g. image processing clientslave master task-parallelism

17 clientmasterslave … repository external web server, ftp server internal lightweight web server control flow data flow data

18 client can be automatic or manual (user) describe project file prepares task code prepares input files submit the tasks either wait for the results (stay connected to master), or detach from the tasks and get results later (master will keep all messages)

19 client programming basic level the user has an executable to be run on multiple comps uses our generic client to submit tasks intermediate level submission script (XML interface) is changed advanced level user writes his own client application using our API hacker level modifies QADPZ source code for extra functionality  see manual

20 jobs, tasks, subtasks job: consists of groups of tasks executed sequentially or in parallel a task can consist of subtasks (same executable is run but with different input data) – for parallel tasks each task is submitted individually, the user specifies which OS and min. resource requirements (disk, mem) the master allocates the most suitable slave for executing the tasks and notifies the client when task is finished, results are stored as specified and the client is notified

21 the tasks regular binary executable code no modifications required must be compiled for each of the platforms regular interpreted program shell script, Perl, Python Java program requires interpreter/VM on each slave dynamically loaded slave library (our API) better performance more flexibility

Win32 i sphere.prj 2 50 sph/sphere.txt sph/layout.txt e.g. job description

23 the master keeps account of all existing slaves (status, specifications) usually one master is enough more can be used if there are too many slaves (communication protocol allows one master to act as another client, but not fully implemented yet) keeps account of all submitted jobs/tasks keeps a list of accepted users (based on username/passwd) gathers statistics about slaves, tasks can optionally run the internal web server for the repository

24 installation one of our computer labs (Rose salen) ~80 PCs Pentium3, 733 MHz, 128 MBytes dual boot: Win2000 and FreeBSD running for several month when a student logs in into the computer, the slave running on that computer is set into disable mode (no new computing tasks are accepted, any current tasks in killed and/or restarted on another comp.) obtained results worth weeks of computation in just a couple of days

25 master

26 status

27 communication layered communication protocol exchanged messages are in XML (w/ compress+encrypt) uses UDP with a reliable layer on top

28 OO design

29 installation slave qadpz_slave (daemon) + slave.cfg master qadpz_master (daemon) + master.cfg + qadpz_admin + users.txt + privkey client qadpz_run + client.cfg + pubkey Linux, Win32 (9x,2K,XP), SunOS, IRIX, FreeBSD, Darwin MacOSX

30 future work local caching of executables on the slaves different scheduling protocols on master web interface to the client creating jobs easier, with input data starting/stopping jobs monitoring execution of jobs easy access to the output of execution should decrease learning effort for using the system

31 the team QADPZ = Atle Pedersen Diego Federici Pavel Petrovic Zoran Constantinescu from the Division of Intelligent Systems (DIS)

32 thank you ?