War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate.

Slides:



Advertisements
Similar presentations
CS-495 Distributed Systems Fabián E. Bustamante, Winter 2004 Processes Threads & OS Threads in distributed systems Object servers Code migration Software.
Advertisements

COS 461 Fall 1997 Network Objects u first good implementation: DEC SRC Network Objects for Modula-3 u recent implementation: Java RMI (Remote Method Invocation)
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Automatic Verification Book: Chapter 6. What is verification? Traditionally, verification means proof of correctness automatic: model checking deductive:
CS252: Systems Programming Ninghui Li Program Interview Questions.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Remote Procedure Call (RPC)
Multithreaded Programs in Java. Tasks and Threads A task is an abstraction of a series of steps – Might be done in a separate thread – Java libraries.
1 Implementing Master/Slave Algorithms l Many algorithms have one or more master processes that send tasks and receive results from slave processes l Because.
Intertask Communication and Synchronization In this context, the terms “task” and “process” are used interchangeably.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.
Multiple Processor Systems
Page 1 Mutual Exclusion* Distributed Systems *referred to slides by Prof. Paul Krzyzanowski at Rutgers University and Prof. Mary Ellen Weisskopf at University.
Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.
Feb 25, 2003Mårten Trolin1 Previous lecture More on hash functions Digital signatures Message Authentication Codes Padding.
Shared Memory Coordination We will be looking at process coordination using shared memory and busy waiting. –So we don't send messages but read and write.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Performance Analysis and Monitoring Facilities in CPN Tools Tutorial CPN’05 October 25, 2005 Lisa Wells.
Incremental Network Programming for Wireless Sensors IEEE SECON 2004 Jaein Jeong and David Culler UC Berkeley, EECS.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.
Distributed-Memory Programming Using MPIGAP Vladimir Janjic International Workhsop “Parallel Programming in GAP” Aug 2013.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
NATs and UDP Victor Norman CS322 Spring NAPT Suppose we have a router doing NAT: half is the “public side”, IP address ; other half is.
New features for CORBA 3.0 by Steve Vinoski Presented by Ajay Tandon.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Presenter: Long Ma Advisor: Dr. Zhang 4.5 DISTRIBUTED MUTUAL EXCLUSION.
CSE 451: Operating Systems Winter 2015 Module 22 Remote Procedure Call (RPC) Mark Zbikowski Allen Center 476 © 2013 Gribble, Lazowska,
Synchronization Methods in Message Passing Model.
Games Development 2 Overview & Entity IDs and Communication CO3301 Week 1.
Jigsaw Performance Analysis Potential Bottlenecks.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CS 101 – Nov. 23 Communication, continued LANs –Bus (ethernet) communication –Token ring communication How the Internet works: TCP/IP.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Global State Collection
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Threads-Process Interaction. CONTENTS  Threads  Process interaction.
Martin Kruliš by Martin Kruliš (v1.1)1.
Group Communication Theresa Nguyen ICS243f Spring 2001.
1 Distributed BDD-based Model Checking Orna Grumberg Technion, Israel Joint work with Tamir Heyman, Nili Ifergan, and Assaf Schuster CAV00, FMCAD00, CAV01,
Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.
Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
Network Layer (2). Review Physical layer: move bits between physically connected stations Data link layer: move frames between physically connected stations.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Christopher Brown and Kevin Hammond School of Computer Science, University of St. Andrews July 2010 Ever Decreasing Circles: Implementing a skeleton for.
Chapter pages1 Distributed Process Management Chapter 14.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Background Computer System Architectures Computer System Software.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
MULTIPROCESSING MODULE OF PYTHON. CPYTHON  CPython is the default, most-widely used implementation of the Python programming language.  CPython - single-threaded.
COMPUTER NETWORKS CS610 Lecture-27 Hammad Khalid Khan.
Block 9: Assignment Briefing
湖南大学-信息科学与工程学院-计算机与科学系
CSE 451: Operating Systems Autumn 2003 Lecture 16 RPC
Background and Motivation
Task-Farm Distributed Computing
CSE 451: Operating Systems Winter 2003 Lecture 16 RPC
Presentation transcript:

War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate by exchanging messages –We do not have shared memory Communication is much more expensive –Sending a message takes much more time than sending data through a channel –Possibly non-uniform communication We only have 1-to-1 communication (no many-to- many channels)

Initial Distributed-memory Settings We consider settings where there is no multithreading within a single MPI node We consider systems where communication latency between different nodes is –Low –Uniform

Good Shared Memory Orbit Version {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 O1O1 Hash Server Thread 1 Worker Threads f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 1, x 2, …, x m ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } O2O2 Hash Server Thread 2 {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } O3O3 Hash Server Thread 3 Shared Task Pool

Why is this version hard to port to MPI? Singe task pool! –Requires a shared structure to which all of the hash servers write data, and all of the workers read data from Not easy to implement using MPI, where we only have 1-to-1 communication We could have a dedicated node which will hold task queue –Workers send messages to it to request work –Hash servers send messages to it to push work –This would make the node potential bottleneck, and would involve a lot of communication

MPI Version 1 Maybe merge workers and hash servers? Each MPI node acts both as a hash server and as a worker Each node has its own task pool If task pool of a node is empty, the node tries to steal work from some other node

MPI Version 1 {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 MPI Nodes f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 11,x 12,… x 1m1 ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 21,x 22,… x 2m2 ] f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 31,x 32,… x 3m3 ]

MPI Version 1 is Bad! Bad performance, especially for smaller number of nodes Same process does hash table lookups, and applies generator functions to points –It cannot do both at the same time => something has to wait –This creates contention

MPI Version 2 Separate hash servers and workers, after all Hash server nodes –Keep parts of the hash table –Also keep parts of task pool Worker nodes just apply generators to points Workers obtain work from hash server nodes using work-stealing

MPI Version 2 {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 O1O1 Worker nodes f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 11,x 12,… x 1m1 ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } O2O2 {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } O3O3 [x 21,x 22,… x 2m2 ] [x 31,x 32,… x 3m3 ] Hash Server nodes T1T1 T2T2 T3T3 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5

MPI Version 2 Much better performance than MPI Version 1 (on low-latency systems) Key thing is separating hash lookup and applying generators to points in different nodes

Big Issue with MPI Versions 1 and 2 -- Detecting Termination! We need to detect the situation where all of the hash server nodes have empty task pools, and where no new work will be produced by hash servers! –Even detecting that all task pools are empty and all hash servers and all workers are idle is not enough, as there may be messages flying around that will create more work! –Woe unto me! What are we to do? Good ol’ Dijkstra comes to rescue - We use a variant of Dijkstra-Scholten Termination Detection Algorithm

Termination Detection Algorithm Each hash server keeps two counters –Number of points sent (my_nr_points_sent) –Number of points received (my_nr_points_rcvd) We enumerate hash servers - H 0 … H n Hash server H 0, when idle, sends a token to the hash server H 1 –It attaches a token count (my_nr_points_sent, my_nr_points_rcvd) to the token When a hash server H i receives a token –If it is active (has tasks in the task pool), sends the token back to H 0 –If it is idle, it increases each component of the count attached to the token and sends the token to H i+1 –If received token count was (pts_sent, pts_rcvd), the new token count is (my_nr_points_sent + pts_sent, my_nr_points_rcvd + pts_rcvd) If H 0 receives the token, and if token count is (pts_sent, pts_rcvd) such that pts_rcvd = num_gens * pts_sent, then termination is detected

MPIGAP Code for MPI Version 2 Not trivial (~400 lines of GAP code) Explicit message passing using low-level MPI bindings –This version is hard to implement using task abstraction

MPIGAP Code for MPI Version 2 Worker := function(gens,op,f) local g,j,n,m,res,t,x,toSend,idle; n := nrHashes; while true do t := GetWork(); if IsIdenticalObj (t, fail) then return; fi; m := QuoInt(Length(t)*Length(gens)*2,n); res := List([1..n],x->EmptyPlist(m)); for j in [1..Length(t)] do for g in gens do x := op(t[j],g); Add(res[f(x)],x); od; for j in [1..n] do if Length(res[j]) > 0 then OrbSendMessage(res[j],minHashId+j-1); fi; od; end;

MPIGAP Code for MPI Version 2 GetWork := function() local msg, target; tid := minHashId; OrbSendMessage(["getwork",processId],tid); msg := OrbGetMessage(true); if msg[1]<>"finish" then return msg; else return fail; fi; end;

MPIGAP Code for MPI Version 2 OrbGetMessage := function(blocking) local test, msg, tmp, veg; if blocking then test := MPI_Probe(); else test := MPI_Iprobe(); fi; if test then msg := UNIX_MakeString(MPI_Get_count()); MPI_Recv(msg); tmp := DeserializeNativeString(msg); totalProcTime := totalProcTime + veg; else return fail; fi; end; OrbSendMessage := function(raw,dest) local msg; msg := SerializeToNativeString(raw); MPI_Binsend(msg,dest,Length(msg)); end;

Work in Progress - Extending MPI Version 2 To Systems With Non-Uniform Latency Communication latencies between nodes might be different Where to place hash server nodes? And how many? How to do work distribution? –Is work stealing still a good idea in a setting where communication distance between a worker and different hash servers is not uniform? We can look at the Shared memory + MPI world as a special case of this –Multithreading within MPI nodes –Threads from the same node can communicate fast –Nodes communicate much slower