Download presentation
Presentation is loading. Please wait.
Published byAlexina McDonald Modified over 9 years ago
1
Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo Research Laboratory
2
Motivation Threads are Ubiquitous Parallel Programming for Performance Manage Multiple Connections System Structuring Mechanism Overhead Thread Management Synchronization Opportunities Improved Memory Management
3
What This Talk is About New Abstraction: Parallel Interaction Graph Points-To Information Reachability and Escape Information Interaction Information Caller-Callee Interactions Starter-Startee Interactions Action Ordering Information Analysis Algorithm Analysis Uses (synchronization elimination, stack allocation, per-thread heap allocation)
4
Outline Example Analysis Representation and Algorithm Lightweight Threads Results Conclusion
5
Sum Sequence of Numbers 98153726
6
Group in Subsequences 98153726
7
Sum Subsequences (in Parallel) 98153726 + 6 + 17 + 10 + 8
8
Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 0
9
Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 17
10
Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 23
11
Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 33
12
Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 41
13
Common Schema Set of tasks Chunk tasks to increase granularity Tasks have both Independent computation Updates to shared data
14
Realization in Java class Accumulator { int value = 0; synchronized void add(int v) { value += v; }
15
Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector
16
Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector Enumeration
17
Realization in Java void generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start(); } void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1), a); }
18
Accumulator 0 Task Generation
19
Accumulator Vector 0 Task Generation
20
Accumulator Vector 0 Task Generation 2
21
62 Accumulator Vector 0 Task Generation
22
work dest Task 62 Accumulator Vector 0 Task Generation
23
work dest Task 62 Accumulator Vector 0 98 Task Generation
24
work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector Task Generation
25
work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector work dest Task 51 Vector Task Generation
26
Analysis
27
Analysis Overview Interprocedural Interthread Flow-sensitive Statement ordering within thread Action ordering between threads Compositional, Bottom Up Explicitly Represent Potential Interactions Between Analyzed and Unanalyzed Parts Partial Program Analysis
28
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Abstraction: Points-to Graph Nodes Represent Objects Edges Represent References workdest Task Vector Enumeration this
29
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Nodes Objects Created Within Current Analysis Scope One Inside Node Per Allocation Site Represents All Objects Created At That Site workdest Task Vector Enumeration this
30
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes Objects Created Outside Current Analysis Scope Objects Accessed Via References Created Outside Current Analysis Scope workdest Task Vector Enumeration this
31
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes One per Static Class Field One per Parameter One per Load Statement Represents Objects Loaded at That Statement workdest Task Vector Enumeration this
32
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Edges References Created Inside Current Analysis Scope workdest Task Vector Enumeration this
33
Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Edges References Created Outside Current Analysis Scope Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part workdest Task Vector Enumeration this
34
Concept of Escaped Node Escaped Nodes Represent Objects Accessible Outside Current Analysis Scope parameter nodes, load nodes static class field nodes nodes passed to unanalyzed methods nodes reachable from unanalyzed but started threads nodes reachable from escaped nodes Node is Captured if it is Not Escaped
35
Why Escaped Concept is Important Completeness of Analysis Information Complete information for captured nodes Potentially incomplete for escaped nodes Lifetime Implications Captured nodes are inaccessible when analyzed part of the program terminates Memory Management Optimizations Stack allocation Per-Thread Heap Allocation
36
Intrathread Dataflow Analysis Computes a points-to escape graph for each program point Points-to escape graph is a pair I - set of inside edges O - set of outside edges e - escape information for each node
37
Dataflow Analysis Initial state: I :formals point to parameter nodes, classes point to class nodes O: Ø Transfer functions: I´ = (I – Kill I ) U Gen I O´ = O U Gen O Confluence operator is U
38
Intraprocedural Analysis Must define transfer functions for: copy statement l = v load statement l 1 = l 2.f store statement l 1.f = l 2 return statement return l object creation site l = new cl method invocation l = l 0.op(l 1 …l k )
39
copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I l v Existing edges
40
copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I Generated edges l v
41
load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I l1l1 l2l2 Existing edges f
42
load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I Generated edges l1l1 l2l2 f
43
load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f l1l1 l2l2 Existing edges
44
load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f Generated edges l1l1 l2l2 n f
45
store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I l2l2 Existing edges l1l1
46
store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I Generated edges l2l2 l1l1 f
47
object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl l Existing edges
48
object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl Generated edges l n
49
Method Call Analysis of a method call: Start with points-to escape graph before the call site Retrieve the points-to escape graph from analysis of callee Map outside nodes of callee graph to nodes of caller graph Combine callee graph into caller graph Result is the points-to escape graph after the call site
50
v t a Points-to Escape Graph before call to t = new Task(v,a) Start With Graph Before Call
51
work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Retrieve Graph from Callee
52
work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Map Parameters from Callee to Caller
53
work dest v t a this w d Combined Graph after call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Transfer Edges from Callee to Caller work dest
54
v t a Combined Graph after call to t = new Task(v,a) Discard Parameter Nodes from Callee work dest
55
Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x More General Example y z
56
Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Initialize Mapping Map Formals to Actuals y z
57
Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Extend Mapping Match Inside and Outside Edges y Mapping is Unidirectional From Callee to Caller z
58
Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes y z
59
Combined Graph after call to x.foo() Points-to Escape Graph from analysis of foo() this x Combine Mapping Project Edges from Callee Into Combined Graph y z
60
Combined Graph after call to x.foo() x Discard Callee Graph z
61
Combined Graph after call to x.foo() x Discard Outside Edges From Captured Nodes z
62
Interthread Analysis Augment Analysis Representation Parallel Thread Set Action Set (read,write,sync,create edge) Action Ordering Information (relative to thread start actions) Thread Interaction Analysis Combine points-to graphs Induces combination of other information Can perform interthread analysis at any point to improve precision of results
63
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Combining Points-to Graphs xthis
64
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Initialize Mapping Map Startee Thread to Starter Thread xthis
65
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis
66
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis
67
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis Mapping is Bidirectional From Startee to Starter From Starter to Startee
68
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes xthis
69
Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis
70
Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis
71
Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis
72
Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis
73
Combined Points-to Escape Graph sometime after call to x.start() Discard StarteeThread Node xthis
74
Combined Points-to Escape Graph sometime after call to x.start() Discard Startee Thread Node x
75
Combined Points-to Escape Graph sometime after call to x.start() Discard Outside Edges From Captured Nodes x
76
Life is not so Simple Dependences between phases Mapping best framed as constraint satisfaction problem Solved using constraint satisfaction algorithm
77
Interthread Analysis With Actions and Ordering
78
Accumulator be a workdest Task d c Vector t a Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions happen before thread a starts executing” Analysis Result for generateTask
79
6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Parallel Threads Actions rd 1 rd 2 rd 3 rd 4 Action Ordering no parallel threads none rd 5 wr 5 sync 2 rd 6 wr 6 Points-to Graph Analysis Result for run sync 5 edge(1,2) edge(1,5) edge(2,3) edge(3,4)
80
Role of edge(1,2) Actions One edge action for each outside edge Action order for edge actions improves precision of interthread analysis If starter thread reads a reference before startee thread is started Then reference was not created by startee thread Outside edge actions record order Inside edges from startee matched only against parallel outside edges
81
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering edge(1,2) || 1
82
Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering (i.e., edge(1,2) created before started) 1 none
83
Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” Analysis Result After Interaction rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a
84
Roles of Intrathread and Interthread Analyses Basic Analysis Intrathread analysis delivers parallel interaction graph at each program point records parallel threads does not compute thread interaction Choose program point (end of method) Interthread analysis delivers additional precision at that program point Does not exploit ordering information from thread join constructs
85
Join Ordering t = new Task(); t.start(); “computation that runs in parallel with task t” t.join(); “computation that runs after task t” t.run(); “computation from task t”
86
Exploiting Join Ordering At join point Interthread analysis delivers new (more precise) parallel interaction graph Intrathread analysis uses new graph No parallel interactions between Thread Computation after join
87
Extensions Partial program analysis can analyze method independent of callers can analyze method independent of methods it invokes can incrementally analyze callees to improve precision Dial down precision to improve efficiency Demand-driven formulations
88
Key Ideas Explicitly represent potential interactions between analyzed and unanalyzed parts Inside versus outside nodes and edges Escaped versus captured nodes Precisely bound ignorance Exploit ordering information intrathread (flow sensitive) interthread (starts, edge orders, joins)
89
Analysis Uses Overheads in Standard Execution and How to Eliminate Them
90
6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Intrathread Analysis Result from End of run Method Enumeration object is captured Does not escape to caller Does not escape to parallel threads Lifetime of Enumeration object is bounded by lifetime of run Can allocate Enumeration object on call stack instead of heap
91
Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vector object is captured Multiple threads synchronize on Vector object But synchronizations from different threads do not occur concurrently Can eliminate synchronization on Vector object Interthread Analysis Result from End of generateTask Method
92
Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vectors, Tasks, Integers captured Parent, child access objects Parent completes accesses before child starts accesses Can allocate objects on child’s per-thread heap Interthread Analysis Result from End of generateTask Method
93
Thread Overhead Inefficient Thread Implementations Thread Creation Overhead Thread Management Overhead Stack Overhead Use a more efficient thread implementation User-level thread management Per-thread heaps Event-driven form
94
Standard Thread Implementation return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread
95
Standard Thread Implementation return address frame pointer x y return address frame pointer b c a save area Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread
96
Event-Driven Form return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Build continuation on heap Copy out live variables Return out of computation Resume another continuation One stack per processor c x resume method resume method
97
Complications Standard thread models use blocking I/O Automatically convert blocking I/O to asynchronous I/O Scheduler manages interleaving of thread executions Stack Allocatable Objects May Be Live Across Blocking Calls Transfer allocation to per-thread heap
98
Opportunity On a uniprocessor, compiler controls placement of context switch points If program does not hold lock across blocking call, can eliminate lock
99
Experimental Results MIT Flex Compiler System Static Compiler Native code for StrongARM Server Benchmarks http, phone, echo, time Scientific Computing Benchmarks water, barnes
100
Server Benchmark Characteristics IR Size (instrs) Number of Methods Pre Analysis Time (secs) echo 4,63913128 time 4,57313629 http 10,643292103 phone 9,54726775 Intra Thread Analysis Time (secs) Inter Thread Analysis Time (secs) 74 70 199 191 73 74 269 256
101
Percentage of Eliminated Synchronization Operations 0 20 40 60 80 100 httpphonetimeechomtrt Intrathread only Interthread
102
Compilation Options for Performance Results Standard kernel threads, synch included Event-Driven event-driven, no synch at all +Per-Thread Heap event-driven, no synch at all, per-thread heap allocation
103
Throughput (Responses per Second) Standard Event-Driven +Per-Thread Heap echotime http 2K http 20K 0 100 200 300 400 phone
104
water25,5833351156 IR Size (instrs) Number of Methods Total Analysis Time (secs) barnes19,764364491 380 Pre Analysis Time (secs) 129 Scientific Benchmark Characteristics
105
Compiler Options 0: Sequential C++ 1: Baseline - Kernel Threads 2: Lightweight Threads 3: Lightweight Threads + Stack Allocation 4: Lightweight Threads + Stack Allocation - Synchronization
106
0 0.2 0.4 0.6 0.8 1 Baseline +Light +Stack -Synch Execution Times Proportion of Sequential C++ Execution Time water smallwaterbarnes
107
Related Work Pointer Analysis for Sequential Programs Chatterjee, Ryder, Landi (POPL 99) Sathyanathan & Lam (LCPC 96) Steensgaard (POPL 96) Wilson & Lam (PLDI 95) Emami, Ghiya, Hendren (PLDI 94) Choi, Burke, Carini (POPL 93)
108
Related Work Pointer Analysis for Multithreaded Programs Rugina and Rinard (PLDI 99) (fork-join parallelism, not compositional) We have extended our points-to analysis for multithreaded programs (irregular, thread-based concurrency, compositional) Escape Analysis Blanchet (POPL 98) Deutsch (POPL 90, POPL 97) Park & Goldberg (PLDI 92)
109
Related Work Synchronization Optimizations Diniz & Rinard (LCPC 96, POPL 97) Plevyak, Zhang, Chien (POPL 95) Aldrich, Chambers, Sirer, Eggers (SAS99) Blanchet (OOPSLA 99) Bogda, Hoelzle (OOPSLA 99) Choi, Gupta, Serrano, Sreedhar, Midkiff (OOPSLA 99) Ruf (PLDI 00)
110
Conclusion New Analysis Algorithm Flow-sensitive, compositional Multithreaded programs Explicitly represent interactions between analyzed and unanalyzed parts Analysis Uses Synchronization elimination Stack allocation Per-thread heap allocation Lightweight Threads
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.