Clusters Part 2 - Hardware Lars Lundberg The slides in this presentation cover Part 2 (Chapters 5-7) in Pfister’s book.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Distributed Data Processing
Distributed Processing, Client/Server and Clusters
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Clusters Part 1 - Definition of and motivation for clusters Lars Lundberg The slides in this presentation cover Part 1 (Chapters 1-4) in Pfister’s book.
2. Computer Clusters for Scalable Parallel Computing
Introduction CSCI 444/544 Operating Systems Fall 2008.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Distributed Processing, Client/Server, and Clusters
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Distributed Hardware How are computers interconnected ? –via a bus-based –via a switch How are processors and memories interconnected ? –Private –shared.
Chapter 17 Parallel Processing.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Chapter 9. Concepts in Parallelisation An Introduction
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
PMIT-6102 Advanced Database Systems
Computer System Architectures Computer System Software
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 18 Database System Architectures Debbie Hui CS 157B.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
10 1 Chapter 10 Distributed Database Management Systems Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Background Computer System Architectures Computer System Software.
Primitive Concepts of Distributed Systems Chapter 1.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Chapter 16: Distributed System Structures
Distributed System Structures 16: Distributed Structures
Shared Memory Multiprocessors
Chapter 17: Database System Architectures
CSE8380 Parallel and Distributed Processing Presentation
Distributed Systems CS
Multithreaded Programming
Database System Architectures
Operating System Overview
Presentation transcript:

Clusters Part 2 - Hardware Lars Lundberg The slides in this presentation cover Part 2 (Chapters 5-7) in Pfister’s book

Exposed vs. Enclosed Clusters Intra-cluster communication Enclosed Exposed Intra-cluster communication

Exposed Clusters n The nodes must communicate by messages, since public standard communication is always message-based n Communication has high overhead since it is based on standard protocols n The communication channel itself is not secure, so additional work must be done to ensure the privacy of intracluster communication n It is relatively easy to include computers that are spread out across a campus area or an company n These clusters are easy to build. In fact, you do not have to build them at all. It is just a matter of running the right software.

Enclosed Clusters n Communication can be by a number of means: shared disk, shared memory, messages etc. n It is possible to obtain communication with low overhead n The security of the communication is implicit n It is easier to implement cluster software on enclosed clusters, since security is not an issue, he cluster cannot be split into two parts that may have to be merged later

“Glass-House” vs. “Campus-Wide” Clusters In the “glass-house” case the computers are fully dedicated to their use as shared computational resources and will therefor be located in a geographically compact arrangement (the glass-house) In the “campus-wide” case (also know as NOW - Network Of Workstations) the computers are located on the users’ desks. Campus-wide clusters operate in a less-controlled environment and they must quickly and totally relinquish use of a node to a user.

The Four Categories of Cluster Hardware n I/O-Attached Message-Based n I/O-Attached Shared Storage n Memory-Attached Shared Storage n Memory-Attached Message-Based

I/O-Attached Message-Based Processor MemI/O Processor MemI/O LAN FDDI ATM etc

I/O-Attached Shared Storage Processor MemI/O Processor MemI/O

Memory-Attached Shared Storage (Global shared memory) Processor MemI/O Processor MemI/O Shared Memory

Memory-Attached Shared Storage (Distributed shared memory) Processor MemI/O Processor MemI/O This architecture can also be used for Memory-Attached Message-Based, even if no such systems are available at the moment

I/O- vs. Memory-Attached n I/O-attached message-passing is the only possibility for heterogeneous systems n Memory attachment in general is harder than I/O attachment, for two reasons: u The hardware of most machines is designed to accept foreign attachments in its I/O system u The software for the basic memory-to-memory is more difficult to construct n When memory attachment is operational is can potentially provide communication that is dramatically faster than that of I/O attachment

Shared Storage vs. Message-based n Shared storage are considered to be easier to use and program (Pfister is not only considering shared-disk clusters but also SMP computers) n Message-passing is considered to more portable and scalable n The hardware aspect is mainly a performance issue, whereas the programming model concerns the usability of the system, e.g. a shared memory (or disk) model can be obtained without physically sharing the memory or disk.

Communication Requirements n The required bandwidth between the cluster nodes is (obviously) very depending on the workload. n For I/O intensive workloads the intra-cluster communication bandwidth should at least equal the aggregate bandwidth from all other I/O sources that each node has. n The bandwidth requirements are particularly difficult to meet in shared nothing (message-based) clusters. n A number of techniques have been developed for increasing the intra cluster communication bandwidth (see Section 5.5 in Pfister’s book).

Symmetric Multiprocessors (SMPs) Processor I/OMemory Disk LAN

SMP Caches Processor I/OMemory Disk LAN Cache

NUMA Multiprocessors Processor MMU Memory Processor MMU Memory Processor MMU Memory Processor node

CC-NUMA Multiprocessors Processor node Cache MMU Memory Processor Cache MMU Memory Processor Cache MMU Memory Processor

COMA Multiprocessors Processor node Cache MMU Attraction Memory Processor Cache MMU Attraction Memory Processor Cache MMU Attraction Memory Processor

Running serial programs on a cluster n It is simple (almost trivial), but very useful to run a number of serial jobs on cluster. The relevant performance metric in this case is throughput. n Three types of serial workloads can be distinguished: u Batch processing u Interactive logins, e.g. one can log onto a cluster without specifying a node. Useful in number-crunching applications with intermediate results u Multijob parallel, e.g. a sequence of coarse grained jobs (almost the same as batch processing)

Running parallel programs on a cluster We classify parallel programs into two categories: n Programs that justify that a large effort is used for making them run efficiently on a cluster, e.g.: u Grand challenge problems: global weather simulation etc. u Heavily used programs, DBMS, LINPACK etc. u Academic research n Programs where only a minimal effort is justified for making them run efficiently on a cluster, e.g.: u Database applications - use parallel DBMS u Technical computing - use parallel LINPACK etc. u Programs that are parallelized automatically by the compiler

Amdahl’s Law Total execution time = serial part + parallel part If we use N processors (computers), the best we can hope for is the following: Total execution time = serial part + (parallel part / N) For instance, if the serial part is 5% of the total execution time, the best we can hope for is a speedup of 20 even if we use hundreds or thousands of processors.

Programming models n Programs written to exploit SMP parallelism will not work (efficiently) on clusters n Programs written to exploit message-based cluster parallelism will not work (efficiently) on SMPs n Pfister has a long discussion about this in chapter 9.

Serial program do forever max_change = 0; for y = 2 to N-1 for x = 2 to N-1 old_value = v[x,y] v[x,y] = (v[x-1,y] + v[x+1,y] + v[x,y-1] + v[x,y+1])/4 max_change = max(max_change, abs(old_value - v[x,y])) end for x end for y if max_change < close_enough then leave do forever end do forever

Parallel program - first attempt do forever max_change = 0; forall y = 2 to N-1 forall x = 2 to N-1 old_value = v[x,y] v[x,y] = (v[x-1,y] + v[x+1,y] + v[x,y-1] + v[x,y+1])/4 max_change = max(max_change, abs(old_value - v[x,y])) end forall x end forall y if max_change < close_enough then leave do forever end do forever

Parallel program - second attempt do forever max_change = 0; forall y = 2 to N-1 forall x = 2 to N-1 old_value = v[x,y] v[x,y] = (v[x-1,y] + v[x+1,y] + v[x,y-1] + v[x,y+1])/4 aquire(max_change_lock) max_change = max(max_change, abs(old_value - v[x,y])) release(max_change_lock) end forall x end forall y if max_change < close_enough then leave do forever end do forever

Parallel program - third attempt do forever max_change = 0; forall y = 2 to N-1 row_max = 0; for x = 2 to N-1 old_value = v[x,y] v[x,y] = (v[x-1,y] + v[x+1,y] + v[x,y-1] + v[x,y+1])/4 row_max = max(row_max, abs(old_value-v[x,y])) end for x aquire(max_change_lock) max_change = max(max_change,row_max) release(max_change_lock) end forall y if max_change < close_enough then leave do forever end do forever

Commercial programming models For systems with a small (< 16) processors: n Threads n Processes that share a memory segment For larger systems: n Global I/O, i.e. all computers use the same file system n RPC (Remote Procedure Calls) n Global Locks