Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo.

Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo Research Laboratory

Motivation Threads are Ubiquitous Parallel Programming for Performance Manage Multiple Connections System Structuring Mechanism Overhead Thread Management Synchronization Opportunities Improved Memory Management

What This Talk is About New Abstraction: Parallel Interaction Graph Points-To Information Reachability and Escape Information Interaction Information Caller-Callee Interactions Starter-Startee Interactions Action Ordering Information Analysis Algorithm Analysis Uses (synchronization elimination, stack allocation, per-thread heap allocation)

Outline Example Analysis Representation and Algorithm Lightweight Threads Results Conclusion

Sum Sequence of Numbers 98153726

Group in Subsequences 98153726

Sum Subsequences (in Parallel) 98153726 + 6 + 17 + 10 + 8

Add Sums Into Accumulator 98153726 + 6 + 17 + 10 + 8 Accumulator 0

Common Schema Set of tasks Chunk tasks to increase granularity Tasks have both Independent computation Updates to shared data

Realization in Java class Accumulator { int value = 0; synchronized void add(int v) { value += v; }

Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector

Realization in Java class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; } public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } 0 workdest Task 62 Accumulator Vector Enumeration

Realization in Java void generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start(); } void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1), a); }

Accumulator 0 Task Generation

Accumulator Vector 0 Task Generation

Accumulator Vector 0 Task Generation 2

62 Accumulator Vector 0 Task Generation

work dest Task 62 Accumulator Vector 0 Task Generation

work dest Task 62 Accumulator Vector 0 98 Task Generation

work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector Task Generation

work dest Task 62 Accumulator Vector 0 work dest Task 98 Vector work dest Task 51 Vector Task Generation

Analysis

Analysis Overview Interprocedural Interthread Flow-sensitive Statement ordering within thread Action ordering between threads Compositional, Bottom Up Explicitly Represent Potential Interactions Between Analyzed and Unanalyzed Parts Partial Program Analysis

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Abstraction: Points-to Graph Nodes Represent Objects Edges Represent References workdest Task Vector Enumeration this

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Nodes Objects Created Within Current Analysis Scope One Inside Node Per Allocation Site Represents All Objects Created At That Site workdest Task Vector Enumeration this

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes Objects Created Outside Current Analysis Scope Objects Accessed Via References Created Outside Current Analysis Scope workdest Task Vector Enumeration this

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Nodes One per Static Class Field One per Parameter One per Load Statement Represents Objects Loaded at That Statement workdest Task Vector Enumeration this

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Inside Edges References Created Inside Current Analysis Scope workdest Task Vector Enumeration this

Analysis Result for run Method Accumulator public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); } Outside Edges References Created Outside Current Analysis Scope Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part workdest Task Vector Enumeration this

Concept of Escaped Node Escaped Nodes Represent Objects Accessible Outside Current Analysis Scope parameter nodes, load nodes static class field nodes nodes passed to unanalyzed methods nodes reachable from unanalyzed but started threads nodes reachable from escaped nodes Node is Captured if it is Not Escaped

Why Escaped Concept is Important Completeness of Analysis Information Complete information for captured nodes Potentially incomplete for escaped nodes Lifetime Implications Captured nodes are inaccessible when analyzed part of the program terminates Memory Management Optimizations Stack allocation Per-Thread Heap Allocation

Intrathread Dataflow Analysis Computes a points-to escape graph for each program point Points-to escape graph is a pair I - set of inside edges O - set of outside edges e - escape information for each node

Dataflow Analysis Initial state: I :formals point to parameter nodes, classes point to class nodes O: Ø Transfer functions: I´ = (I – Kill I ) U Gen I O´ = O U Gen O Confluence operator is U

Intraprocedural Analysis Must define transfer functions for: copy statement l = v load statement l 1 = l 2.f store statement l 1.f = l 2 return statement return l object creation site l = new cl method invocation l = l 0.op(l 1 …l k )

copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I l v Existing edges

copy statement l = v Kill I = edges(I, l ) Gen I = { l } × succ(I, v ) I´ = (I – Kill I ) U Gen I Generated edges l v

load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I l1l1 l2l2 Existing edges f

load statement l 1 = l 2.f S E = {n 2 in succ(I, l 2 ). escaped(n 2 )} S I = U {succ(I, n 2, f ). n 2 in succ(I, l 2 )} case 1: l 2 does not point to an escaped node (S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × S I Generated edges l1l1 l2l2 f

load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f l1l1 l2l2 Existing edges

load statement l 1 = l 2.f case 2: l 2 does point to an escaped node (not S E = Ø ) Kill I = edges(I, l 1 ) Gen I = { l 1 } × (S I U {n}) Gen O = (S E × {f}) × {n} where n is the load node for l 1 = l 2.f Generated edges l1l1 l2l2 n f

store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I l2l2 Existing edges l1l1

store statement l 1.f = l 2 Gen I = (succ(I, l 1 ) × { f }) × succ(I, l 2 ) I´ = I U Gen I Generated edges l2l2 l1l1 f

object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl l Existing edges

object creation site l = new cl Kill I = edges(I, l ) Gen I = { } where n is inside node for l = new cl Generated edges l n

Method Call Analysis of a method call: Start with points-to escape graph before the call site Retrieve the points-to escape graph from analysis of callee Map outside nodes of callee graph to nodes of caller graph Combine callee graph into caller graph Result is the points-to escape graph after the call site

v t a Points-to Escape Graph before call to t = new Task(v,a) Start With Graph Before Call

work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Retrieve Graph from Callee

work dest v t a this w d Points-to Escape Graph before call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Map Parameters from Callee to Caller

work dest v t a this w d Combined Graph after call to t = new Task(v,a) Points-to Escape Graph from analysis of Task(w,d) Transfer Edges from Callee to Caller work dest

v t a Combined Graph after call to t = new Task(v,a) Discard Parameter Nodes from Callee work dest

Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x More General Example y z

Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Initialize Mapping Map Formals to Actuals y z

Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Extend Mapping Match Inside and Outside Edges y Mapping is Unidirectional From Callee to Caller z

Points-to Escape Graph before call to x.foo() Points-to Escape Graph from analysis of foo() this x Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes y z

Combined Graph after call to x.foo() Points-to Escape Graph from analysis of foo() this x Combine Mapping Project Edges from Callee Into Combined Graph y z

Combined Graph after call to x.foo() x Discard Callee Graph z

Combined Graph after call to x.foo() x Discard Outside Edges From Captured Nodes z

Interthread Analysis Augment Analysis Representation Parallel Thread Set Action Set (read,write,sync,create edge) Action Ordering Information (relative to thread start actions) Thread Interaction Analysis Combine points-to graphs Induces combination of other information Can perform interthread analysis at any point to improve precision of results

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Combining Points-to Graphs xthis

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Initialize Mapping Map Startee Thread to Starter Thread xthis

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Extend Mapping Match Inside and Outside Edges xthis Mapping is Bidirectional From Startee to Starter From Starter to Startee

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes xthis

Combined Points-to Escape Graph sometime after call to x.start() Combine Graphs Project Edges Through Mappings Into Combined Graph xthis

Combined Points-to Escape Graph sometime after call to x.start() Discard StarteeThread Node xthis

Combined Points-to Escape Graph sometime after call to x.start() Discard Startee Thread Node x

Combined Points-to Escape Graph sometime after call to x.start() Discard Outside Edges From Captured Nodes x

Life is not so Simple Dependences between phases Mapping best framed as constraint satisfaction problem Solved using constraint satisfaction algorithm

Interthread Analysis With Actions and Ordering

Accumulator be a workdest Task d c Vector t a Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions happen before thread a starts executing” Analysis Result for generateTask

6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Parallel Threads Actions rd 1 rd 2 rd 3 rd 4 Action Ordering no parallel threads none rd 5 wr 5 sync 2 rd 6 wr 6 Points-to Graph Analysis Result for run sync 5 edge(1,2) edge(1,5) edge(2,3) edge(3,4)

Role of edge(1,2) Actions One edge action for each outside edge Action order for edge actions improves precision of interthread analysis If starter thread reads a reference before startee thread is started Then reference was not created by startee thread Outside edge actions record order Inside edges from startee matched only against parallel outside edges

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering edge(1,2) || 1

Points-to Escape Graph sometime after call to x.start() Points-to Escape Graph from analysis of run() Edge Actions in Combining Points-to Graphs 1 2 3 xthis Action Ordering (i.e., edge(1,2) created before started) 1 none

Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” Analysis Result After Interaction rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a

Roles of Intrathread and Interthread Analyses Basic Analysis Intrathread analysis delivers parallel interaction graph at each program point records parallel threads does not compute thread interaction Choose program point (end of method) Interthread analysis delivers additional precision at that program point Does not exploit ordering information from thread join constructs

Join Ordering t = new Task(); t.start(); “computation that runs in parallel with task t” t.join(); “computation that runs after task t” t.run(); “computation from task t”

Exploiting Join Ordering At join point Interthread analysis delivers new (more precise) parallel interaction graph Intrathread analysis uses new graph No parallel interactions between Thread Computation after join

Extensions Partial program analysis can analyze method independent of callers can analyze method independent of methods it invokes can incrementally analyze callees to improve precision Dial down precision to improve efficiency Demand-driven formulations

Key Ideas Explicitly represent potential interactions between analyzed and unanalyzed parts Inside versus outside nodes and edges Escaped versus captured nodes Precisely bound ignorance Exploit ordering information intrathread (flow sensitive) interthread (starts, edge orders, joins)

Analysis Uses Overheads in Standard Execution and How to Eliminate Them

6 Enumeration Accumulator 25 1 workdest Task 4 3 Vector this Intrathread Analysis Result from End of run Method Enumeration object is captured Does not escape to caller Does not escape to parallel threads Lifetime of Enumeration object is bounded by lifetime of run Can allocate Enumeration object on call stack instead of heap

Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vector object is captured Multiple threads synchronize on Vector object But synchronizations from different threads do not occur concurrently Can eliminate synchronization on Vector object Interthread Analysis Result from End of generateTask Method

Accumulator be a workdest Task d c Vector t Parallel Threads Actions wr a wr b wr c wr d sync b rd b Points-to Graph Action Ordering “All actions from current thread happen before thread a starts executing” rd a, a rd b, a rd c, a rd d, a rd e, a wr e, a sync b, a sync e, a a Vectors, Tasks, Integers captured Parent, child access objects Parent completes accesses before child starts accesses Can allocate objects on child’s per-thread heap Interthread Analysis Result from End of generateTask Method

Thread Overhead Inefficient Thread Implementations Thread Creation Overhead Thread Management Overhead Stack Overhead Use a more efficient thread implementation User-level thread management Per-thread heaps Event-driven form

Standard Thread Implementation return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread

Standard Thread Implementation return address frame pointer x y return address frame pointer b c a save area Call frames allocated on stack Context Switch Save state on stack Resume another thread One stack per thread

Event-Driven Form return address frame pointer x y return address frame pointer b c a Call frames allocated on stack Context Switch Build continuation on heap Copy out live variables Return out of computation Resume another continuation One stack per processor c x resume method resume method

Complications Standard thread models use blocking I/O Automatically convert blocking I/O to asynchronous I/O Scheduler manages interleaving of thread executions Stack Allocatable Objects May Be Live Across Blocking Calls Transfer allocation to per-thread heap

Opportunity On a uniprocessor, compiler controls placement of context switch points If program does not hold lock across blocking call, can eliminate lock

Experimental Results MIT Flex Compiler System Static Compiler Native code for StrongARM Server Benchmarks http, phone, echo, time Scientific Computing Benchmarks water, barnes

Server Benchmark Characteristics IR Size (instrs) Number of Methods Pre Analysis Time (secs) echo 4,63913128 time 4,57313629 http 10,643292103 phone 9,54726775 Intra Thread Analysis Time (secs) Inter Thread Analysis Time (secs) 74 70 199 191 73 74 269 256

Percentage of Eliminated Synchronization Operations 0 20 40 60 80 100 httpphonetimeechomtrt Intrathread only Interthread

Compilation Options for Performance Results Standard kernel threads, synch included Event-Driven event-driven, no synch at all +Per-Thread Heap event-driven, no synch at all, per-thread heap allocation

Throughput (Responses per Second) Standard Event-Driven +Per-Thread Heap echotime http 2K http 20K 0 100 200 300 400 phone

water25,5833351156 IR Size (instrs) Number of Methods Total Analysis Time (secs) barnes19,764364491 380 Pre Analysis Time (secs) 129 Scientific Benchmark Characteristics

Compiler Options 0: Sequential C++ 1: Baseline - Kernel Threads 2: Lightweight Threads 3: Lightweight Threads + Stack Allocation 4: Lightweight Threads + Stack Allocation - Synchronization

0 0.2 0.4 0.6 0.8 1 Baseline +Light +Stack -Synch Execution Times Proportion of Sequential C++ Execution Time water smallwaterbarnes

Related Work Pointer Analysis for Sequential Programs Chatterjee, Ryder, Landi (POPL 99) Sathyanathan & Lam (LCPC 96) Steensgaard (POPL 96) Wilson & Lam (PLDI 95) Emami, Ghiya, Hendren (PLDI 94) Choi, Burke, Carini (POPL 93)

Related Work Pointer Analysis for Multithreaded Programs Rugina and Rinard (PLDI 99) (fork-join parallelism, not compositional) We have extended our points-to analysis for multithreaded programs (irregular, thread-based concurrency, compositional) Escape Analysis Blanchet (POPL 98) Deutsch (POPL 90, POPL 97) Park & Goldberg (PLDI 92)

Related Work Synchronization Optimizations Diniz & Rinard (LCPC 96, POPL 97) Plevyak, Zhang, Chien (POPL 95) Aldrich, Chambers, Sirer, Eggers (SAS99) Blanchet (OOPSLA 99) Bogda, Hoelzle (OOPSLA 99) Choi, Gupta, Serrano, Sreedhar, Midkiff (OOPSLA 99) Ruf (PLDI 00)

Conclusion New Analysis Algorithm Flow-sensitive, compositional Multithreaded programs Explicitly represent interactions between analyzed and unanalyzed parts Analysis Uses Synchronization elimination Stack allocation Per-thread heap allocation Lightweight Threads

Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo.

Similar presentations

Presentation on theme: "Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo.

Similar presentations

Presentation on theme: "Analyses and Optimizations for Multithreaded Programs Martin Rinard, Alex Salcianu, Brian Demsky MIT Laboratory for Computer Science John Whaley IBM Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback