Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Soft Real-Time Semi-Partitioned Scheduling with Restricted Migrations on Uniform Heterogeneous Multiprocessors Kecheng Yang James H. Anderson Dept. of.

Parallel System Performance CS 524 – High-Performance Computing.

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

CS 584. Review n Systems of equations and finite element methods are related.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

A Hierarchical Energy-Efficient Framework for Data Aggregation in Wireless Sensor Networks IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 55, NO. 3, MAY.

Building the communication performance model of heterogeneous clusters based on a switched network Alexey Lastovetsky

Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.

PGA – Parallel Genetic Algorithm Hsuan Lee. Reference  E Cantú-Paz, A Survey on Parallel Genetic Algorithm, Calculateurs Paralleles, Reseaux et Systems.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Parallel System Performance CS 524 – High-Performance Computing.

The hybird approach to programming clusters of multi-core architetures.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.

Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.

Distributed Database Systems Overview

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Xiaobing Wu, Guihai Chen

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Static Process Scheduling

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Data Structures and Algorithms in Parallel Computing

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Name : Mamatha J M Seminar guide: Mr. Kemparaju. GRID COMPUTING.

Parallel Data Laboratory, Carnegie Mellon University

CLUSTER COMPUTING.

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Alexey Lastovetsky, Maureen O’Flynn

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ ISPDC’07 Hagenberg July 6, 2007

Outline ● Motivation and Goals ● Background ● The ‘Square-Corner’ Partitioning ● MPI Experiments / Results ● Conclusion / Future Work

Motivation ● Partitioning algorithms for parallel computing designed for n nodes result in partitionings which are not always optimal on a small number of nodes ● We previously presented a new ‘Square-Corner’ partitioning strategy for computing matrix products with two clusters which has two advantages over existing strategies: Reducing the Total Volume of Communication Overlapping Communication and Computation Motivation and Goals

Goal ● Our goal is to determine if the Square-Corner partitioning is applicable as the top-level partitioning of a hierarchal partitioning algorithm for parallel computations on three clusters. ● To do so we model three clusters with three processors. ● Controllable, tunable environment ● A top-level partitioning across clusters would treat clusters as aggregates, or individual nodes ● After the top-level partitioning, clusters can perform local partitioning according to local architecture Motivation and Goals

Goal The three processor model is viable because just as with clusters, local communications are often a magnitude faster than intra-processor/cluster communications. We assume perfect computational load balance. Motivation and Goals

Background

Rectangular Partitionings for Matrix Multiplication Total Volume of Inter-Processor Communication (TVC) = Background TVC = Two Nodes Three Nodes Arrows represent data movement necessary for each node to compute its partial product Each node ‘owns’ the correspondingly shaded partitions

Square-Corner Partitioning Two Nodes Three Nodes Background

The Square-Corner Partitioning: Reducing the total volume of communication (TVC)

Half Perimeters and the Lower Bound TVC is proportional to the sum of all partition perimeters. For simplicity we use the Sum of the Half-Perimeters (P). Lower Bound ( L ) of P is when all partitions are square: Reducing the Total Volume of Communication

Rectangular Partitioning Two NodesThree Nodes Reducing the Total Volume of Communication

Square-Corner Partitioning Two Nodes Three Nodes Reducing the Total Volume of Communication

Restriction In the Square-Corner partitioning, squares cannot ‘overlap’ which imposes the following restriction on the relative node speeds: Reducing the Total Volume of Communication

Network Topology With three nodes we have a choice of network topology, fully connected and linear array (non-wraparound). Reducing the Total Volume of Communication We must investigate each topology independently

Fully Connected Network When is the Square-Corner sum of half perimeters (and therefore the TVC) less than that of the Rectangular on a fully connected network? The hatched area violates the stated restriction. The striped area is where Square Corner has a lower TVC: at ratios of about 8:1:1 and greater. Reducing the Total Volume of Communication

RectangularSquare-Corner Reducing the Total Volume of Communication Linear Array Network Node 1 (the fastest node) is in the middle of the array

Linear Array Network When is the Square-Corner TVC less than that of the Rectangular on the Linear Array Network when Node 1 is the middle node? Reducing the Total Volume of Communication For all power ratios, subject to stated restriction

The Square-Corner Partitioning: Overlapping Communication and Computation

Overlapping Communication and Computation A sub-partition of Node 1’s C partition is immediately calculable. No communications are necessary to compute C1=A1×B1. Overlapping Communication and Computation Node 1’s Partitions

Square-Corner Partitioning Overlapping Communication and Computation Overlapping Communication and Computation

Results MPI experiments - Matrix Matrix Multiplication N = 5000 Bandwidth = 100Mb/s Three identical nodes with CPU limiting software (to achieve different power ratios) Node speed ratio expressed as S1:S2:S3 remember, S1 ≥ S2 ≥ S3 For simplicity, S2 = S3, S1+S2+S3 = 100 Results

Linear Array Communication Time Lower Total Volume of Communication translates to lower Communication Times. Average Reduction in Communication Time = 40% Results

Linear Array Execution Time Results Lower Communication Times result in lower Execution Times. Overlapping further reduces Execution Time

Fully-Connected Communication Time Lower Total Volume of Communication translates to lower Communication Times at expected ratios. (above 8:1:1) Results

Fully Connected Execution Time Results Overlapping further reduces execution time Overlapping also broadens the range of ratios where the Square- Corner partitioning is faster from >8:1:1 to >3:1:1

Results ● Similar Results were observed for different bandwidth values and power ratios (including when S2 ≠ S3). Results

Conclusions ● We successfully applied the Square-Corner Partitioning to Three Nodes. ● The Square-Corner Partitioning approaches the theoretical lower bound for the TVC unlike existing rectangular partitionings. ● For a fully connected network The Square-Corner Partitioning reduces the TVC and the communication time when the power ratio is ~ 8:1:1 or greater. ● For the linear array network it results in a lower TVC and communication time for all power ratios (due to restriction, greater than 2:1:1). ● These lower communication times directly influence execution times. _______________________________________________________ Conclusions

● The possibility of overlapping communication and computation brings further reductions in execution time ● Overlapping broadens the ratio range where the Square-Corner partitioning outperforms the Rectangular from >8:1:1 to >3:1:1. ● The Square-Corner Partitioning is viable for experimentation as the top-level partitioning of a hierarchal partitioning algorithm on Three Clusters _______________________________________________________ Conclusions

Future Work ● Apply the Three-Node Algorithm to Three Clusters ● Investigate an optimal Overlapping of Communications and Computations ● Investigate 4+ nodes/clusters _______________________________________________________ Future Work

Acknowledgements This work was supported by: