Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Rinard Laboratory for Computer Science

Similar presentations


Presentation on theme: "Martin Rinard Laboratory for Computer Science"— Presentation transcript:

1 Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication
Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Pedro Diniz Information Sciences Institute University of Southern California

2 Context Parallelizing Parallel Program Compiler with Sequential
Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program

3 Context Basic Idea: View computation as atomic operations on objects
Parallelizing Compiler Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program Basic Idea: View computation as atomic operations on objects If all pairs of operations in a given phase commute (generate same final result in both execution orders) Compiler generates parallel code

4 Optimized Parallel Program
Context Parallelizing Compiler Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program Synchronization Optimization Lock Coarsening Adaptive Replication Optimized Parallel Program with Mutual Exclusion Synchronization and Data Replication

5 Outline Example Model of Computation Basic Issues
Interaction with Lock Coarsening Experimental Results Conclusion

6 Example 3 2 4 1 6 5 7 2 1

7 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14

8 Outline of Algorithm Graph Traversal Acquire Lock in Object Update Sum
Release Lock In Parallel, Recursively Traverse Left Child and Right Child

9 Parallel Program class node { lock mutex; node *left, *right;
int left_weight; int right_weight; int sum; }; void node::traverse(int weight) { mutex.acquire(); sum += weight; mutex.release() if (left !=NULL) spawn left->traverse(left_weight); if (right!=NULL) spawn right->traverse(right_weight); }

10 Example 5 2 4 7 2 1 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2

11 Example 5 2 4 7 2 1 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2

12 Example 5 2 5 2 4 7 2 1 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2

13 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2

14 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2

15 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 2

16 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 3

17 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 4

18 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 6

19 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 8

20 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 9

21 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 12

22 Example 5 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14

23 Synchronization Bottleneck
Lots of Updates to One Object Because of Mutual Exclusion, Updates Execute Sequentially Processors Spend Time Waiting to Acquire the Lock in the Object Performance Suffers

24 Solution in Example Replicate Object that Causes Bottleneck
Give Each Processor Its Own Local Copy Each Processor Updates Local Copy Combine Copies at End of Parallel Phase 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 Replicate This Object

25 Example with Four Processors
3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 Processor Processor 1 Processor 2 Processor 3

26 Add In First Number 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 2 1 2 3 Processor
2 1 2 3 Processor Processor 1 Processor 2 Processor 3

27 Add In Second Number 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 3 3 3 5 Processor
3 3 3 5 Processor Processor 1 Processor 2 Processor 3

28 Combine To Get Final Result
3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14 + 3 5 Processor 1 2

29 Goal: Automate Technique of Replicating Objects to Eliminate Synchronization Bottlenecks

30 Object-Based Model of Computation
4 Objects Instance variables (left, right, sum, …) represent state of each object Operations on Receiver Objects In example, traverse is an operation Updated graph node is receiver object Operation Execution Updates instance variables in receiver Invokes other operations 3 2 3 2

31 Execution of Application Consists of
Parallel Execution Execution of Application Consists of an Alternating Sequence of Serial Phases and Parallel Phases Serial Phase Parallel Phase Serial Phase Parallel Phase Serial Phase

32 Operations in Parallel Phases
Instance variable updates execute atomically Each object has mutual exclusion lock Lock acquired before updates Lock released after updates Invoked operations execute in parallel

33 Legality of Replicating Objects
Is it always legal to replicate objects? No. All updates to object in parallel phase must be replicatable Updates of the form v = v+exp are replicatable, where + is a commutative, associative operator with a zero, and variables in exp are not updated during the parallel phase

34 Which Objects to Replicate?
Why Not Just Replicate All Replicatable Objects? Some Objects Don’t Cause Bottlenecks Replication Overhead Space for Copies Time to Create and Initialize Copies Goal Identify Objects With High Contention Replicate Only Those Objects

35 Basic Approach Dynamically Measure Contention At Each Object
If Contention Is High Replicate Object (Dynamically) Perform Update on Local Copy If No Contention Perform Update on Original Object Pay Replication Overhead Only When There is a Payoff in Parallelism

36 Details What is the replication policy?
Processor attempts to acquire lock. Creates local copy only if it fails to acquire lock. Where are replicas stored? In a hash table. Can’t space overhead be too high? No. Impose a space limit. If a replication would exceed space limit, don’t replicate object. Wait for lock.

37 More Details What happens at end of parallel phase?
Generated code traverses hash tables Finds Replicas Combines contributions -> original objects Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

38 More Details What happens at end of parallel phase?
Generated code traverses hash tables Finds Replicas Combines contributions -> original objects Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

39 More Details What happens at end of parallel phase?
Generated code traverses hash tables Finds Replicas Combines contributions -> original objects Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

40 More Details What happens at end of parallel phase?
Generated code traverses hash tables Finds Replicas Combines contributions -> original objects Deallocates replicas Processor 0 14 Processor 1 Hash Tables Original Object

41 Generated Code void node::traverse(int weight) {
node *replica = lookup(this); // Check for existing copy if (replica) replica->replicaTraverse(p); // Update existing copy else if (mutex.tryAcquire()) { // Try to acquire lock 1: sum += weight; // Perform update on original object mutex.release(); if (left !=NULL) spawn left->traverse(leftWeight); if (right!=NULL) spawn right->traverse(rightWeight); } else { // No existing copy, failed to acquire lock replica = this->replicate(); // Try to replicate object if (replica) replica->replicaTraverse(p); // Update new copy else{mutex.acquire();goto 1;}// Replicate failed, wait for lock }

42 Updates Execute Without Synchronization
Updating A Replica void node::replicaTraverse(int weight) { sum += weight; if (left !=NULL) spawn left->traverse(leftWeight); if (right!=NULL) spawn right->traverse(rightWeight); } Updates Execute Without Synchronization

43 Replicating An Object void node::replicate() {
// Check to see if limit exceeded if (allocated + sizeof(node) > limit) return(NULL); // Allocate New Copy node *replica = new node; allocated += sizeof(node); // Zero out updated fields replica->value = 0; // Copy other fields replica->left = left; replica->leftWeight = leftWeight; replica->right = right; replica->rightWeight = rightWeight; insert(this,replica); // Insert replica into hash table return(replica); }

44 Adaptive Replication Summary
Static Analysis to Discover Replicatable Objects Dynamic Measurement of Contention to Determine Which Objects to Replicate Generated Code Measures Contention Replicates Objects Updates Original and Replica Objects Combines Results in Replicas Back Into Original Objects

45 Lock Coarsening obj.mutex.acquire(); update obj obj.mutex.release();
unsynchronized computation obj.mutex.acquire(); update obj unsynchronized computation obj.mutex.release();

46 Lock Coarsening obj.mutex.acquire(); while (c) { while (c) {
unsynchronized computation update obj } obj.mutex.release(); while (c) { unsynchronized computation obj.mutex.acquire(); update obj obj.mutex.release(); }

47 Lock Coarsening Tradeoffs
Advantage: Fewer Executed Lock Constructs Acquires Releases Less Lock Overhead Disadvantage: Critical Sections Larger May Cause Additional Serialization In Some Cases, Completely Serializes Parallel Phase

48 Lock Coarsening Tradeoffs With Adaptive Replication
Advantages: Fewer Executed Lock and Replication Constructs Replica Lookups Lock Acquires and Releases Less Lock and Replication Overhead No Additional Serialization Disadvantage: Potential For Increased Memory Usage

49 Result Automatically Generated Code That Replicates Objects to Eliminate Synchronization Bottlenecks Replication Policy Dynamically Adapts to the Amount of Contention for Each Object on Each Processor Lock Coarsening Plus Adaptive Replication Increases Granularity and Reduces Overhead Without Increasing Serialization

50 Experimental Results Prototype Implementation
In Context of Parallelizing Compiler Commutativity Analysis Lock Coarsening, Adaptive Replication Four Versions Adaptive Replication, Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, Best Lock Coarsening Full Replication, Lock Coarsening

51 Applications and Hardware Platform
Three Applications Water Barnes-Hut String Hardware Platform SGI Challenge XL MHz R4400 Mips Processors, IRIX Operating System, Version 6.2 MipsPro Compiler, Version 7.1

52 Speedups for Water Adaptive Replication, with Lock Coarsening
No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

53 Time Breakdowns for Water
Adaptive Replication, with Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

54 Peak Memory for Water Adaptive Replication, with Lock Coarsening
No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

55 Speedups for Barnes-Hut
Adaptive Replication, with Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, with Lock Coarsening Always Replicate, with Lock Coarsening

56 Time Breakdowns for Barnes-Hut
Adaptive Replication, with Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, with Lock Coarsening Always Replicate, with Lock Coarsening

57 Peak Memory for Barnes-Hut
Adaptive Replication, with Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, with Lock Coarsening Always Replicate, with Lock Coarsening

58 Speedups for String Adaptive Replication, with Lock Coarsening
No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

59 Time Breakdowns for String
Adaptive Replication, with Lock Coarsening Adaptive Replication, No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

60 Peak Memory for String Adaptive Replication, with Lock Coarsening
No Lock Coarsening No Replication, No Lock Coarsening Always Replicate, with Lock Coarsening

61 Related Work Reduction Analysis for Loop Nests
Pinter and Pinter (POPL 91) Fisher and Ghuloum (PLDI 94) Callahan (LCPC 91) Hall, Amarasinghe, Murphy, Liao, and Lam (SuperComputing 95) Replication for Concurrent Reads (Caching)

62 Conclusion Basic Idea: Replicate Objects to Eliminate Synchronization Bottlenecks Adaptive: Dynamically Identifies and Replicates High-Contention Objects Only Synergistic Interaction with Lock Coarsening Robust - Enables Good Performance Without Running Risk of Excessive Memory Consumption or Run-Time Overhead Algorithm for Analysis and Transformation of Explicitly Parallel Programs

63 Context Commutativity Analysis (IPPS 96, PLDI 96)
Semantic Foundations (EuroPar 96) Lock Optimizations Lock Coarsening (LCPC 96) General Transformations (POPL 97) Dynamic Feedback (PLDI 97) Optimistic Synchronization (PPoPP 97) Adaptive Replication (ICS 99)

64 Maximum Speedup Comparison


Download ppt "Martin Rinard Laboratory for Computer Science"

Similar presentations


Ads by Google