Clustering Technology. Clustering Schematic Cluster Components Cluster hardware (processor, main memory, hard disk, …) Cluster network (Fast Ethernet,

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Distributed Processing, Client/Server and Clusters

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Distributed Systems CS

2. Computer Clusters for Scalable Parallel Computing

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Types of Parallel Computers

Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.

Distributed Processing, Client/Server, and Clusters

Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Reference: Message Passing Fundamentals.

Parallel Programming Models and Paradigms

Memory Management 2010.

Figure 1.1 Interaction between applications and the operating system.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Mapping Techniques for Load Balancing

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)

PMIT-6102 Advanced Database Systems

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

Parallel Computing Presented by Justin Reschke

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

OpenMosix, Open SSI, and LinuxPMI

Distributed Shared Memory

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Parallel Programming in C with MPI and OpenMP

Database System Architectures

Parallel Programming in C with MPI and OpenMP

Operating System Overview

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Clustering Technology

Clustering Schematic

Cluster Components Cluster hardware (processor, main memory, hard disk, …) Cluster network (Fast Ethernet, Gigabit Ethernet, Myrinet, …) Cluster Software (operating system, programming environment, …)

Cluster Operating System Characteristics Manageability: An absolute necessity is remote and intuitive system administration; this is often associated with a Single System Image (SSI) which can be realized on different levels, ranging from a high-level set of special scripts down to real state-sharing on the OS level. Stability: The most important characteristics are robustness against crashing processes, failure recovery by dynamic reconfiguration, and usability under heavy load. Performance: The performance critical parts of the OS, such as memory management, process and thread scheduler, file I/O and communication protocols should work in as efficiently as possible. Extensibility: The OS should allow the easy integration of cluster-specific extensions, which will most likely be related to the inter-node cooperation. A good example for this is the MOSIX system that is based on Linux. Scalability: The scalability of a cluster is mainly influenced by the provision of the contained nodes, which is dominated by the performance characteristics of the interconnect. Support: Many intelligent and technically superior approaches in computing failed due to the lack of support in its various aspects: which tools, hardware drivers and middleware environments are available. Heterogeneity: Clusters provide a dynamic and evolving environment in that they can be extended or updated with standard hardware just as the user needs to or can afford. Therefore, a cluster environment does not necessarily consist of homogenous hardware.

Cluster Solution Hardware of Cluster nodes is typical PC’s (this selection reduces the cost per performance ratio). Fast Ethernet was used as Cluster interconnection (17 nodes connected by Fast Ethernet infrastructure). Linux was choosed as Cluster OS to provide our needs as much as possible (we can recompile and tune the kernel to meet our needs). Using VMware software for virtualizing our computing resources. Message Passing Interface was selected as the parallel programming environment.

Configuring Cluster Configuring the Cluster nodes (network configuration, packages installation, …). Optimizing and securing the Linux OS to extract the maximum utilization from Cluster resources. Cluster administration (Samba service, ssh, rlogin, rcp, administration scripts and …).

Algorithm Identification

Integer Factorization

Sieving

Trial Division

QS Algorithm

MPQS Algorithm

SIQS Algorithm

Algorithm Complexity Improvement QS  MPQS  SIQS  NFS

Optimizing Serial Implementation

Algorithm level optimizations (the most important step for optimizing serial codes is to reduce the complexity of algorithm maximally). Code level optimizations (in this phase we use Compiler level optimizations

Algorithm Optimization In computation-intensive software programs, we will often find that 99% of the CPU time is used in the innermost loop. Identifying the most critical part of your software is therefore necessary if you want to improve the speed of computation (by profilers). Study the algorithm used in the critical part of your code and see if it can be improved.

Innermost loop (Conventional Sieving)

Pentium4 Memory Access Times

An Optimized Sieving Approach (1)

An Optimized Sieving Approach (2)

Code level optimization techniques Loop unrolling (Unrolling amortizes the branch overhead, since it eliminates branches and some of the code to manage induction variables. Unrolling allows you to aggressively schedule (or pipeline) the loop to hide latencies). Function inlining (We can instruct the compiler to insert the code of a function into the code of its callers, to the point where actually the call is to be made. inlining method reduces the function-call overhead. In a compiler, inlining a function exposes more opportunity for optimization). gcc inline assembly (Assembly routines written as inline functions. They are handy, speedy and very much useful).

Builtin gcc functions (__builtin_prefetch) (This function is used to minimize cache-miss latency by moving data into a cache before it is accessed). Using “unsigned int” type only (Use 32-bit integers instead of integers with smaller sizes (16-bit or 8-bit) to reduce the machine cycles needed). Division-free arithmetic (change division to use multiplication by reciprocals). Release allocated memory blocks. and …

Loop unrolling (code level opt.)

gcc compiler optimizations

Parallel Algorithm Design

Parallel Algorithm Design Methodology Partitioning (Domain decomposition or Functional Decomposition) Communication Agglomeration Mapping

Methodical Design (1) Partitioning: The computation that is to be performed and the data operated on by this computation are decomposed into small tasks. Practical issues such as the number of processors in the target computer are ignored, and attention is focused on recognizing opportunities for parallel execution. Communication: The communication required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined. Agglomeration: The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs. If necessary, tasks are combined into larger tasks to improve performance or to reduce development costs. Mapping: Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. Mapping can be specified statically or determined at runtime by load-balancing algorithms.

Methodical Design (2)

Load Balancing Mechanism For load balancing Master/Slave mechanism was used. Master node sends the initial data and assign the jobs to slave nodes.

Data Decomposition Algorithm Using SPMD model (SPMD program that creates exactly one task per processor). We can sieve with multiple polynomials in SIQS. To generate these polynomials, we must first compute ‘a’ factors. Sieving with separated ‘a’ values can be done independently on different processors. Thus we need to build ‘a’ values on several tasks without any coordination. Duplicated ‘a’ factor conclude to weak concurrency.

Data Decomposition Algorithm (1) ( Initialization Data)

Data Decomposition Algorithm (2) (Determining the number and size of ‘a’ value’s factors)

Data Decomposition Algorithm (3) (Computing the factors of ‘a’ values)

Master Node Algorithm

Slave Node Algorithm

Double Large Prime Variation effects

Master Node Algorithm (1) (Improved version)

Master Node Algorithm (2) (Improved version)

Slave Node Algorithm (Improved version)

Cluster Benchmarks

Performance Evaluation Speedup: Amdahl’s law gives the ideal speedup Sp: Efficiency: The efficiency, p, of a p-node computation with speed-up Sp is given by:

Total Execution Time

Sieving Execution Time

Total Speedup

Sieving Speedup

Total Efficiency

Sieving Efficiency

Sources of inefficiency Static load balancing (overestimate our needs and waste Cluster resources). Communication overhead (Ethernet protocol and tcp/ip stack operations). Load imbalance. Non parallelized stages of the application such as Linear algebra stage (Admahl’s law). Inefficient software and hardware technologies (MPI overhead and OS inefficiencies).

Conclusions and Future works The distributed computing by Cluster technology is an open problem yet. The Mosix project is one of the most successful solutions that builds an single system image and concrete Linux kernels to simulate one single system for running processes. But Mosix and other similar solutions not work optimally for every desired application. In fact supplying a SSI at the operating system level, while a definite boon in terms of manageability, drastically inhibits scalability. The availability of the source code in conjunction with the possibility to extend (and thus modify) the operating system on this base. This property has a negative influence on the stability and manageability of the system: over time, many variants of the operating system will develop, and the different extensions may conflict when there is no single supplier.

Conclusions and Future works In this thesis our goal was to present an global approach to run computational applications on distributed systems optimally. Algorithm optimization is the major step in presented approach. In the computation-intensive software programs, we must in first step identify the most critical part of the software and try to optimize this algorithm as much as possible (most efficient step). In the second phase we implement the best optimized algorithm serially to exploit the maximum usage of local resources. In fact we do distributed computing when local computation cost is bigger than remote execution). In the last stage, parallelize the serial algorithm optimally. You must focus on most computational part of the serial algorithm and try to parallelize this part efficiently.

Conclusions and Future works The performance benchmarks shows that our strategy works well, and therefore we can apply this method to every other application similarly. Using NFS algorithm for RSA key cracking. Parallelize linear algebra stage for factorization of very large numbers (greater than 120 digits).