Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo.

Similar presentations


Presentation on theme: "Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo."— Presentation transcript:

1 Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo Research Laboratory

2 Motivation Threads are Ubiquitous Parallel Programming for Performance Manage Multiple Connections System Structuring Mechanism Overhead Thread Management Synchronization Opportunities Improved Memory Management

3 What This Talk is About New Abstraction: Parallel Interaction Graph Points-To Information Reachability and Escape Information Interaction Information Caller-Callee Interactions Starter-Startee Interactions Action Ordering Information Analysis Algorithm Analysis Uses (synchronization elimination, stack allocation, per-thread heap allocation)

4 Outline Example Analysis Representation and Algorithm Lightweight Threads Results Conclusion

5 Sum Sequence of Numbers 98153726

6 Group in Subsequences 98153726

7 Sum Subsequences (in Parallel) 98153726 + 6 + 17 + 10 + 8

8 Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 0

9 Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 17

10 Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 23

11 Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 33

12 Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 41

13 Common Schema Set of tasks Chunk tasks to increase granularity Tasks have both Independent computation Updates to shared data

14 Realization in Java class Accumulator { int value = 0; synchronized void add(int v) { value += v; }

15 Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector

16 Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector Enumeration

17 Realization in Java void generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start(); } void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1), a); }

18 Accumulator 0 Task Generation

19 Accumulator Vector 0 Task Generation

20 Accumulator Vector 0 Task Generation 2

21 62 Accumulator Vector 0 Task Generation

22 work dest Task 62 Accumulator Vector 0 Task Generation

23 work dest Task 62 Accumulator Vector 0 98 Task Generation

24 work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector Task Generation

25 work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector work dest Task 51 Vector Task Generation

26 Analysis

27 Analysis Overview Interprocedural Interthread Flow-sensitive Statement ordering within thread Action ordering between threads Compositional, Bottom Up Explicitly Represent Potential Interactions Between Analyzed and Unanalyzed Parts Partial Program Analysis

28 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Abstraction: Points-to Graph Nodes Represent Objects Edges Represent References workdest Task Vector Enumeration this

29 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Nodes Objects Created Within Current Analysis Scope One Inside Node Per Allocation Site Represents All Objects Created At That Site workdest Task Vector Enumeration this

30 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes Objects Created Outside Current Analysis Scope Objects Accessed Via References Created Outside Current Analysis Scope workdest Task Vector Enumeration this

31 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes One per Static Class Field One per Parameter One per Load Statement Represents Objects Loaded at That Statement workdest Task Vector Enumeration this

32 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Edges References Created Inside Current Analysis Scope workdest Task Vector Enumeration this

33 Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Edges References Created Outside Current Analysis Scope Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part workdest Task Vector Enumeration this

34 Concept of Escaped Node Escaped Nodes Represent Objects Accessible Outside Current Analysis Scope parameter nodes, load nodes static class field nodes nodes passed to unanalyzed methods nodes reachable from unanalyzed but started threads nodes reachable from escaped nodes Node is Captured if it is Not Escaped

35 Why Escaped Concept is Important Completeness of Analysis Information Complete information for captured nodes Potentially incomplete for escaped nodes Lifetime Implications Captured nodes are inaccessible when analyzed part of the program terminates Memory Management Optimizations Stack allocation Per-Thread Heap Allocation

36 Intrathread Dataflow Analysis Computes a points-to escape graph for each program point Points-to escape graph is a pair I - set of inside edges O - set of outside edges e - escape information for each node

37 Dataflow Analysis Initial state: I :formals point to parameter nodes, classes point to class nodes O: Ø Transfer functions: I´ = (I – Kill I ) U Gen I O´ = O U Gen O Confluence operator is U

38 Intraprocedural Analysis Must define transfer functions for: copy statement l = v load statement l 1 = l 2.f store statement l 1.f = l 2 return statement return l object creation site l = new cl method invocation l = l 0.op(l 1 …l k )

39 copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I l v Existing edges

40 copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I Generated edges l v

41 load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I l1l1 l2l2 Existing edges f

42 load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I Generated edges l1l1 l2l2 f

43 load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f l1l1 l2l2 Existing edges

44 load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f Generated edges l1l1 l2l2 n f

45 store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I l2l2 Existing edges l1l1

46 store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I Generated edges l2l2 l1l1 f

47 object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl l Existing edges

48 object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl Generated edges l n

49 Method Call Analysis of a method call: Start with points-to escape graph before the call site Retrieve the points-to escape graph from analysis of callee Map outside nodes of callee graph to nodes of caller graph Combine callee graph into caller graph Result is the points-to escape graph after the call site

50 v t a Points-to Escape Graph before call to t = new Task(v,a) Start With Graph Before Call

51 work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Retrieve Graph from Callee

52 work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Map Parameters from Callee to Caller

53 work dest v t a this w d Combined Graph after call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Transfer Edges from Callee to Caller work dest

54 v t a Combined Graph after call to t = new Task(v,a) Discard Parameter Nodes from Callee work dest

55 Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x More General Example y z

56 Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Initialize Mapping Map Formals to Actuals y z

57 Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Extend Mapping Match Inside and Outside Edges y Mapping is Unidirectional From Callee to Caller z

58 Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes y z

59 Combined Graph after call to x.foo() Points-to Escape Graph from analysis of foo() this x Combine Mapping Project Edges from Callee Into Combined Graph y z

60 Combined Graph after call to x.foo() x Discard Callee Graph z

61 Combined Graph after call to x.foo() x Discard Outside Edges From Captured Nodes z

62 Interthread Analysis Augment Analysis Representation Parallel Thread Set Action Set (read,write,sync,create edge) Action Ordering Information (relative to thread start actions) Thread Interaction Analysis Combine points-to graphs Induces combination of other information Can perform interthread analysis at any point to improve precision of results

63 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Combining Points-to Graphs xthis

64 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Initialize Mapping Map Startee Thread to Starter Thread xthis

65 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis

66 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis

67 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis Mapping is Bidirectional From Startee to Starter From Starter to Startee

68 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes xthis

69 Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis

70 Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis

71 Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis

72 Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis

73 Combined Points-to Escape Graph sometime after call to x.start() Discard StarteeThread Node xthis

74 Combined Points-to Escape Graph sometime after call to x.start() Discard Startee Thread Node x

75 Combined Points-to Escape Graph sometime after call to x.start() Discard Outside Edges From Captured Nodes x

76 Life is not so Simple Dependences between phases Mapping best framed as constraint satisfaction problem Solved using constraint satisfaction algorithm

77 Interthread Analysis With Actions and Ordering

78 Accumulator be a workdest Task d c Vector t a Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions happen before thread a starts executing” Analysis Result for generateTask

79 6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Parallel Threads Actions rd 1 rd 2 rd 3 rd 4 Action Ordering no parallel threads none rd 5 wr 5 sync 2 rd 6 wr 6 Points-to Graph Analysis Result for run sync 5 edge(1,2) edge(1,5) edge(2,3) edge(3,4)

80 Role of edge(1,2) Actions One edge action for each outside edge Action order for edge actions improves precision of interthread analysis If starter thread reads a reference before startee thread is started Then reference was not created by startee thread Outside edge actions record order Inside edges from startee matched only against parallel outside edges

81 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering edge(1,2) || 1

82 Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering (i.e., edge(1,2) created before started) 1 none

83 Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” Analysis Result After Interaction rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a

84 Roles of Intrathread and Interthread Analyses Basic Analysis Intrathread analysis delivers parallel interaction graph at each program point records parallel threads does not compute thread interaction Choose program point (end of method) Interthread analysis delivers additional precision at that program point Does not exploit ordering information from thread join constructs

85 Join Ordering t = new Task(); t.start(); “computation that runs in parallel with task t” t.join(); “computation that runs after task t” t.run(); “computation from task t”

86 Exploiting Join Ordering At join point Interthread analysis delivers new (more precise) parallel interaction graph Intrathread analysis uses new graph No parallel interactions between Thread Computation after join

87 Extensions Partial program analysis can analyze method independent of callers can analyze method independent of methods it invokes can incrementally analyze callees to improve precision Dial down precision to improve efficiency Demand-driven formulations

88 Key Ideas Explicitly represent potential interactions between analyzed and unanalyzed parts Inside versus outside nodes and edges Escaped versus captured nodes Precisely bound ignorance Exploit ordering information intrathread (flow sensitive) interthread (starts, edge orders, joins)

89 Analysis Uses Overheads in Standard Execution and How to Eliminate Them

90 6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Intrathread Analysis Result from End of run Method Enumeration object is captured Does not escape to caller Does not escape to parallel threads Lifetime of Enumeration object is bounded by lifetime of run Can allocate Enumeration object on call stack instead of heap

91 Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vector object is captured Multiple threads synchronize on Vector object But synchronizations from different threads do not occur concurrently Can eliminate synchronization on Vector object Interthread Analysis Result from End of generateTask Method

92 Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vectors, Tasks, Integers captured Parent, child access objects Parent completes accesses before child starts accesses Can allocate objects on child’s per-thread heap Interthread Analysis Result from End of generateTask Method

93 Thread Overhead Inefficient Thread Implementations Thread Creation Overhead Thread Management Overhead Stack Overhead Use a more efficient thread implementation User-level thread management Per-thread heaps Event-driven form

94 Standard Thread Implementation return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread

95 Standard Thread Implementation return address frame pointer x y return address frame pointer b c a save area Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread

96 Event-Driven Form return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Build continuation on heap Copy out live variables Return out of computation Resume another continuation One stack per processor c x resume method resume method

97 Complications Standard thread models use blocking I/O Automatically convert blocking I/O to asynchronous I/O Scheduler manages interleaving of thread executions Stack Allocatable Objects May Be Live Across Blocking Calls Transfer allocation to per-thread heap

98 Opportunity On a uniprocessor, compiler controls placement of context switch points If program does not hold lock across blocking call, can eliminate lock

99 Experimental Results MIT Flex Compiler System Static Compiler Native code for StrongARM Server Benchmarks http, phone, echo, time Scientific Computing Benchmarks water, barnes

100 Server Benchmark Characteristics IR Size (instrs) Number of Methods Pre Analysis Time (secs) echo 4,63913128 time 4,57313629 http 10,643292103 phone 9,54726775 Intra Thread Analysis Time (secs) Inter Thread Analysis Time (secs) 74 70 199 191 73 74 269 256

101 Percentage of Eliminated Synchronization Operations 0 20 40 60 80 100 httpphonetimeechomtrt Intrathread only Interthread

102 Compilation Options for Performance Results Standard kernel threads, synch included Event-Driven event-driven, no synch at all +Per-Thread Heap event-driven, no synch at all, per-thread heap allocation

103 Throughput (Responses per Second) Standard Event-Driven +Per-Thread Heap echotime http 2K http 20K 0 100 200 300 400 phone

104 water25,5833351156 IR Size (instrs) Number of Methods Total Analysis Time (secs) barnes19,764364491 380 Pre Analysis Time (secs) 129 Scientific Benchmark Characteristics

105 Compiler Options 0: Sequential C++ 1: Baseline - Kernel Threads 2: Lightweight Threads 3: Lightweight Threads + Stack Allocation 4: Lightweight Threads + Stack Allocation - Synchronization

106 0 0.2 0.4 0.6 0.8 1 Baseline +Light +Stack -Synch Execution Times Proportion of Sequential C++ Execution Time water smallwaterbarnes

107 Related Work Pointer Analysis for Sequential Programs Chatterjee, Ryder, Landi (POPL 99) Sathyanathan & Lam (LCPC 96) Steensgaard (POPL 96) Wilson & Lam (PLDI 95) Emami, Ghiya, Hendren (PLDI 94) Choi, Burke, Carini (POPL 93)

108 Related Work Pointer Analysis for Multithreaded Programs Rugina and Rinard (PLDI 99) (fork-join parallelism, not compositional) We have extended our points-to analysis for multithreaded programs (irregular, thread-based concurrency, compositional) Escape Analysis Blanchet (POPL 98) Deutsch (POPL 90, POPL 97) Park & Goldberg (PLDI 92)

109 Related Work Synchronization Optimizations Diniz & Rinard (LCPC 96, POPL 97) Plevyak, Zhang, Chien (POPL 95) Aldrich, Chambers, Sirer, Eggers (SAS99) Blanchet (OOPSLA 99) Bogda, Hoelzle (OOPSLA 99) Choi, Gupta, Serrano, Sreedhar, Midkiff (OOPSLA 99) Ruf (PLDI 00)

110 Conclusion New Analysis Algorithm Flow-sensitive, compositional Multithreaded programs Explicitly represent interactions between analyzed and unanalyzed parts Analysis Uses Synchronization elimination Stack allocation Per-thread heap allocation Lightweight Threads


Download ppt "Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo."

Similar presentations


Ads by Google