Experiments with SmartGridSolve: Achieving Higher Performance by Improving the GridRPC Model Thomas Brady, Michele Guidolin, Alexey Lastovetsky Heterogeneous.

Slides:

Advertisements

Similar presentations

National Institute of Advanced Industrial Science and Technology Ninf-G - Core GridRPC Infrastructure Software OGF19 Yoshio Tanaka (AIST) On behalf.

Advertisements

Three types of remote process invocation

Multiple Processor Systems

Multiple Processor Systems

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Remote Procedure Call (RPC)

Remote Procedure Call Design issues Implementation RPC programming

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Implementing Remote Procedure Calls Andrew Birrell and Bruce Nelson Presented by Kai Cong.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

OpenFOAM on a GPU-based Heterogeneous Cluster

Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Learning Objectives Understanding the difference between processes and threads. Understanding process migration and load distribution. Understanding Process.

Evolution of Code through Asynchronous Services Manuel Oriol Workshop on Unanticipated Software Evolution - USE ’2002.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.

Client-Server Assignment for Internet Distributed Systems.

An Introduction to Programming with CUDA Paul Richmond

Speaker: Xin Zuo Heterogeneous Computing Laboratory (HCL) School of Computer Science and Informatics University College Dublin Ireland International Parallel.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Load Balancing in Distributed Computing Systems Using Fuzzy Expert Systems Author Dept. Comput. Eng., Alexandria Inst. of Technol. Content Type Conferences.

Client Server Technologies Middleware Technologies Ganesh Panchanathan Alex Verstak.

Communication Tran, Van Hoai Department of Systems & Networking Faculty of Computer Science & Engineering HCMC University of Technology.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah

Messaging is an important means of communication between two systems. There are 2 types of messaging. - Synchronous messaging. - Asynchronous messaging.

Recent Advances in SmartGridSolve Oleg Girko School of Computer Science & Informatics University College Dublin.

New features for CORBA 3.0 by Steve Vinoski Presented by Ajay Tandon.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

Distributed Process Scheduling : A Summary

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Chapter 17 - Clients + Servers = Distributed Computing Introduction Large Computers Use Networks For Input and Output Small Computers Use Networks To Interact.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Static Process Scheduling

Using Static Code Analysis to Improve Performance of GridRPC Applications Oleg Girko, Alexey Lastovetsky School of Computer Science & Informatics University.

© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.

Presented by Deepak Varghese Reg No: Introduction Application S/W for server load balancing Many client requests make server congestion Distribute.

- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

TensorFlow– A system for large-scale machine learning

JICOS A Java-Centric Distributed Computing Service

The Client/Server Database Environment

Edinburgh Napier University

Steven Whitham Jeremy Woods

Interpreter Style Examples

COMP60621 Fundamentals of Parallel and Distributed Systems

Chapter 01: Introduction

COMP60611 Fundamentals of Parallel and Distributed Systems

Introduction to Optimization

Presentation transcript:

Experiments with SmartGridSolve: Achieving Higher Performance by Improving the GridRPC Model Thomas Brady, Michele Guidolin, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin

Introduction GridSolve  Programming system for distributed computing  Based on RPC SmartGridSolve is an extension of GridSolve  Aims to achieve higher performance

Motivation Core aspects of GridSolve effecting application performance  Mapping How tasks are assigned to servers  Execution Model How tasks are executed in the distributed environment

Motivation: GridSolve Overview Maps tasks individually on a star network Mapping:  Map task individually Execution Model :  Client-Server Model  Star Network

Motivation: SmartGridSolve Overview Maps a group of tasks on a fully connected network Mapping:  Map a group of tasks Execution Model :  Fully-Connected Network

Motivation : Performance Increase Performance increase  Improved load balancing of computation  Reducing volume of communication  Improved load balancing of communication

Motivation: Performance Increase GridSolve  Tasks are mapped individually  Slow servers may be assigned large tasks. SmartGridSolve : Improves load balancing of computation  Each task in the group is known prior to mapping  Workload of a group of tasks is balanced across servers.

GridSolve  Data transfers are mapped to client links only.  Unnecessary data transfers  Client links are heavily loaded SmartGridSolve : Reduces volume of communication  Data dependencies are known  Reduce Data Transfers Eliminate bridge communication  Server-Server Communication  Caching SmartGridSolve : Improved load balancing of communication  Volume of data transfer are known  More even distribution of communication load Motivation : Performance Increase

Mapping Design: GridSolve Mapping Star Network Discovery Task Discovery time flops Individual Task Mapping Heuristic

Design : GridSolve Execution Model Client Mapping execute task send input args recv output args

Mapping Performance Model of Fully Connected Network Task Graph Mapping Heuristic N …… Mapping Heuristic 2 Mapping Heuristic 1 Mapping Heuristics Design: SmartGridSolve Mapping a Group of Tasks

Client Mapping caching server comm Design: SmartGridSolve Executing a Group of Tasks

Executing on a Client-Server Execution Model Mapping Heuristic Individual Task Discovery Star Network Discovery Executing on a Server Comm. Enabled Execution Model Mapping Heuristic Group of Tasks Group of Task Discovery Fully Connected Network Discovery Processes for mapping individual tasks on a star network Processes for mapping a group of tasks on a fully connected network Design : SmartGridSolve Extensions

Design: Network Discovery Star Network Discovery  Static Performance (LINPACK Benchmark)  Dynamic Performance (CPU Load)  Dynamic Client-Server Bandwidth Fully-Connected Network Discovery  Dynamic Server- Server Bandwidth

Design: GridSolve Task Discovery Static Parameters  Number of arguments  Argument – (input, output)  Argument – (scalar/nonscalar)  Argument object types (matrix, vector.)  Argument data types (int, double, etc..)  Function for complexity (flops) Run-Time Parameters  Dimensions of arguments, variables of complexity function.

Design: GridSolve Task Discovery Static Parameters  Discovered when server is registered Run-Time parameters  Discovered when task is called. complexity input arg size output arg size input arg size..... output arg size....

Build a task graph before any of the tasks are called Design: SmartGridSolve Group of Tasks Discovery

GridSolve  Discovery, Mapping and Execution are one atomic operation SmartGridSolve  To map a group of tasks Separate the discovery and mapping from the execution of tasks. Addition to GridSolve API gs_smart_map(“ex_map”){ //group of tasks } Design: SmartSolve Group of Tasks Discovery

Application Real World Application  Hydropad An astrophysics application that simulates the clustering of galaxies from the big bang till present.

Internal Structure It constitutes of four parts  Initialisation  Gravitation (FFT)  Dark Matter (N -Body)  Baryonic Matter (PPM)

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Start Discovery

Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover

Task Graph

Network Performance Model

Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map Group of Task

Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Start Execution

Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute

barmatter 7 fields 5 darkmatter 1 barmatter 7 fields 5 darkmatter 1 hcl08 3 hcl06 1 hcl08 3 hcl08 3 hcl06 1 hcl08 3 poor load balancing Forced bridge communication GridSolve Mapping of Evolution Stage

barmatter 7 fields 5 darkmatter 1 barmatter 7 fields 5 darkmatter 1 hcl08 3 hcl08 3 hcl06 1 hcl08 3 hcl08 3 hcl06 11 improved load balancing SmartSolve Mapping of Evolution Stage reduced communication volume caching server comm

Published in paper:  ~2 times speedup (client and servers on local network) Results

New enhancements:  Client and server broadcast  Asynchronous communication  ~5 times speedup (client and servers on local network)  ~13 times speedup (client and servers on remote networks) Results

Conclusion SmartGridSolve improves performance  Improved load balancing of computation  Reduces the volume of communication  Improved load balancing of communication

Future Work Future work  Fault Tolerance  Improved synchronisation of tasks  Functional Performance model  ADL language and compiler