Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd.

Slides:



Advertisements
Similar presentations
Cost-effective clustering with OpenPBS113/02/2003 Cost-effective clustering with OpenPBS Ben Webb WGR Research Group Physical and Theoretical Chemistry.
Advertisements

HPCx Power for the Grid Dr Alan D Simpson HPCx Project Director EPCC Technical Director.
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Towards a Virtual European Supercomputing Infrastructure Vision & issues Sanzio Bassini
Beowulf Supercomputer System Lee, Jung won CS843.
Windows HPC Server 2008 Presented by Frank Chism Windows and Condor: Co-Existence and Interoperation.
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
OCCF – The Realtime Grid. 1 Characteristics of Current Grid Computing Static data sets - Generally from fixed length experiments - Statistical measurements.
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
Sun FIRE Jani Raitavuo Niko Ronkainen. Sun FIRE 15K Most powerful and scalable Up to 106 processors, 576 GB memory and 250 TB online disk storage Fireplane.
Workload Management Massimo Sgaravatto INFN Padova.
Xuan Guo Chapter 1 What is UNIX? Graham Glass and King Ables, UNIX for Programmers and Users, Third Edition, Pearson Prentice Hall, 2003 Original Notes.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
DISTRIBUTED COMPUTING
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
3 rd Party Software Gail Alverson August 5, 2005.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,
Cluster Software Overview
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
Interconnection network network interface and a case study.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Batch Systems P. Nilsson, PROOF Meeting, October 18, 2005.
Distributed Real-time Systems- Lecture 01 Cluster Computing Dr. Amitava Gupta Faculty of Informatics & Electrical Engineering University of Rostock, Germany.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
The Hungarian ClusterGRID Project Péter Stefán research associate NIIF/HUNGARNET
Background Computer System Architectures Computer System Software.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
Page : 1 SC2004 Pittsburgh, November 12, 2004 DEISA : integrating HPC infrastructures in Europe DEISA : integrating HPC infrastructures in Europe Victor.
UDel CISC361 Study Operating System principles - processes, threads - scheduling - mutual exclusion - synchronization - deadlocks - memory management -
New Paradigms: Clouds, Virtualization and Co.
Workload Management Workpackage
White Rose Grid Infrastructure Overview
Grid Computing.
CRESCO Project: Salvatore Raia
QNX Technology Overview
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd

Spin out of Warwick (& Oxford) University Specialising in distributed (technical) computing –Cluster and GRID computing technology 14 employees & growing; focussed expertise in: –Scientific Computing –Computer systems and support –Presently 5 PhDs in HPC and Parallel Computation –Expect growth to 20+ people in 2003

Strategy Establish an HPC systems integration company......but re-invest profits into software –Exploiting IP and significant expertise –First software product released –Two more products in prototype stage Two complementary ‘businesses’ –Both high growth

Track Record (2001 – date..) Installations include: –Largest Sun HPC cluster in Europe (176 proc) –Largest Sun / Myrinet cluster in UK (128 proc) –AMD, Intel and Sun clusters at 21 UK Universities –Commercial clients include Akzo Noble, Fujitsu, Maclaren F1, Rolls Royce, Schlumberger, Texaco…. Delivered a 264 proc Intel/Myrinet cluster: –1.3 Tflop/s Peak !! –Forms part of the White Rose Computational Grid

Streamline and Grid Computing Pre-configured ‘grid’-enabled systems: –Clusters and farms –The SCore parallel environment –Virtual ‘desktop’ clusters Grid-enabled software products: –The Distributed Debugging Tool –Large-scale distributed graphics –Scaleable, intelligent & fault tolerant parallel computing

‘Grid’-enabled turnkey clusters Choice of DRMs and schedulers: –(Sun) GridEngine –PBS / PBS-Pro –LSF / ClusterTools –Condor –Maui Scheduler Globus 2.x gatekeeper (Globus 3 ???) Customised access portal

The SCore parallel environment Developed by the Real World Computing Partnership in Japan ( Unique features, that are unavailable in most parallel environments: –Low latency, high bandwidth MPI drivers –Network transparency: Ethernet, Gigabit and Myrinet –Multi-user time-sharing (gang scheduling) –O/S level checkpointing and failover –Integration with PBS and SGE –MPICH-G port –Cluster management functionality

‘Desktop’ Clusters Linux Workstation Strategy –Integrated software stack for HPTC (compilers, tools & libraries) – cf. UNIX workstations Aim to provide a GRID at point of sale: –Single point of administration for several machines –Files served from front-end –Resource management –Globus enabled –Portal A cluster with monitors !!

The Distributed Debugging Tool A debugger for distributed parallel application –Launched at Supercomputing 2002 Aim is to be the de-facto HPC debugging tool –Linux ports for GNU, Absoft, Intel and PGI –IA64 and Solaris ports; AIX and HP-UX soon… –Commodity pricing structure ! Existing architecture lends itself to the GRID: –Thin client GUI + XML middleware + back-end –Expect GRID-enabled version in 2003

Distributed Graphics Software Aims –To enable very large models to be viewed and manipulated using commodity clusters –Visualisation on (local or remote) graphics client Technology –Sophisticated data-partitioning and parallel I/O tools –Compression using distributed model simplification –Parallel (real-time) rendering To be GRID-enabled within e-Science ‘Gviz’ project

Parallel Compiler and Tools Strategy Aim to invest in new computing paradigms Developing parallel applications is far from trivial –OpenMP does not marry with cluster architecture –MPI is too low-level –Few skills in the marketplace ! –Yet growth of MPPs is exponential… Most existing applications are not GRID-friendly –# of processors fixed –No Fault Tolerance –Little interaction with DRM

DRM for Parallel Computation Throughput of parallel jobs is limited by: –Static submission model: ‘mpirun –np …..’ –Static execution model: # processors fixed –Scaleability; many jobs use too many processors ! –Job Starvation Available tools can only solve some issues –Advanced reservation and back-fill (eg Maui) –Multi-user time-sharing (gang scheduling) The application itself must take responsibility !!

Dynamic Job Submission Job scheduler should decide the available processor resource ! The application then requires: –In built partitioning / data management –Appropriate parallel I/O model –Hooks into the DRM DRM requires: –Typical memory and processor requirements –LOS information –Hooks into the application

Dynamic Parallel Execution Additional resources may become available or be required by other applications during execution… Ideal situation: –DRM informs application –Application dynamically re-partitions itself Other issues: –DRM requires knowledge of the application (benefit of data redistribution must outweigh cost !) –Frequency of dynamic scheduling –Message passing must have dynamic capabilities

The Intelligent Parallel Application Optimal scheduling requires more information: –How well the application scales –Peak and average memory requirements –Application performance vs. architecture The application ‘cookie’ concept: –Application (and/or DRM) should gather information about its own capabilities –DRM can then limit # of available processors –Ideally requires hooks into the programming paradigm…

Fault Tolerance On large MPPs, processors/components will fail ! Applications need fault tolerance: –Checkpointing + RAID-like redundancy (cf SCore) –Dynamic repartitioning capabilities –Interaction with the DRM –Transparency from the user’s perspective Fault-tolerance relies on many of the capabilities described above…

Conclusions Commitment to near-term GRID objectives –Turn-key clusters, farms and storage installations –On going development of ‘GRID-enabled’ tools –Driven by existing commercial opportunities…. ‘Blue’-sky project for next generation applications –Exploits existing IP and advanced prototype –Expect moderate income from focussed exploitation –Strategic positioning: existing paradigms will ultimately be a barrier to the success of (V-)MPP computers / clusters !