Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University

 Motivation  Introduction  Cooperative Parallelization  Programmer’s Input  Evaluation  Conclusion

 Program parallelization is a difficult task  Automatic parallelization helps in parallelizing sequential applications  Most of the parallelizing techniques focus on array based applications  Limited support for parallelizing pointer-intensive applications

void traverse_tree (Tree *tree) { if (tree−>left) traverse_tree(tree->left); if (tree->right) traverse_tree(tree->right); process(tree); } void traverse_list (List * list) { List * node = list; while ( node != NULL ) { process(node); node = node−>next; } } Tree Traversal List Traversal

 Program Parallelization is a 2-fold problem  First Problem: Finding where parallelism is available in the application if any  Second Problem: Deciding how to efficiently exploit the available parallelism

 Use static analysis to perform dependence checking and identify independent parts of the program  Target regular structures like arrays and for loops  Pointer intensive codes cannot be analyzed accurately with static analysis

 Pointer intensive applications typically have ◦ Data structures built from input ◦ and while loops to traverse the data structures  Without the points-to information and with out loop counts there is very little we can do at compile time

 In array based applications with for loops sets of iterations are distributed to different threads  In pointer intensive applications information about the data structure is needed to run the parallel code

 The programmer has high level view of the program and can give hints about the program  Hints can indicate things like ◦ If a loop can be parallelized ◦ If function calls are independent ◦ Structure of the working data  All of these bits of information are vital in program parallelization

 To efficiently exploit parallelism in pointer intensive applications we need runtime information ◦ Size and shape of data structure (dependent on input) ◦ Points-to information  Using the points-to information we determine the work distribution

Cooperative Parallelization Cooperative Parallelization Programmer (hints) Compiler Runtime System Sequential Program Parallel Program

 Cooperation between the programmer, the compiler and the runtime system to identify and efficiently exercise parallelism in pointer intensive applications  The task of identifying parallelism in the code is delegated to the programmer  Runtime system is responsible for monitoring the program and efficiently executing parallel code

 Pointer-intensive applications ◦ A data structure is built from the input ◦ The data structure is traversed several times and nodes are processed  The operations on nodes are typically independent  This fact can be obtained from the programmer as a hint

int perimeter (QuadTree tree, int size) { int retval = 0; if (tree−>color==grey) { /*node has children */ retval += perimeter (tree−>nw, size/2); retval += perimeter (tree−>ne, size/2); retval += perimeter (tree−>sw, size/2); retval += perimeter (tree−>se, size/2); } else if (tree−>color==black) {... /* do something on the node*/ } return retval; } tree nw subtree se subtree … Function from perimeter benchmark

void compute_node (node_t * nodelist) { int i; while ( nodelist != NULL ) { for (i=0; i from_count; i++) { node_t *other_node = nodelist−>from_nodes[i]; double coeff = nodelist−>coeffs[i]; double value = other_node−>value; nodelist−>value −= coeff * value; } nodelist = nodelist−>next } Function from em3d benchmark sublist 1 sublist n... head

 Processing of different parts of the data structure (sub problems) can be done in parallel  Needs access to multiple sub problems at runtime  The task of finding these sub problems in the data structure is done by a helper thread

 The helper thread goes over the data structure and finds multiple independent sub problems  The helper thread doesn’t need to traverse the whole data structure to find the sub problems  Using a separate thread for finding the sub problems reduces the overhead

loop Sequential Execution Parallel Execution helper thread application threads loop

helper thread: wait for signal from main thread find subproblems in the data structure signal main thread application thread: wait for signal from main thread work on the subproblems assigned to this thread signal main thread main thread: signal helper thread when data structure is ready wait for signal from helper thread distribute subproblems to application threads signal application threads wait for signal from application threads merge results from all the application threads

 The runtime information collected is used to determine the profitability of parallelization  This decision can be driven by the programmer using a hint  The program is parallelized only if the data structure is “big” enough

 Interface between the programmer and the compiler  Should be simple to use with minimal essential information #parallel tree function (threads) (degree) (struct) {children} threshold [reduction] #parallel llist function (threads) (struct) (next_node) threshold [number]

 Implemented a source-to-source translator  Modified C language grammar to understand the hints Parser Generator Modified C grammar Translator C program with hints Parallel program

Platform Simics Simulator 16 core hardware 32-bit Linux OS BenchmarksData Structure bisortBinary Tree treeAddBinary Tree tspBinary Tree perimeterQuad Tree em3dSingly Linked List mstSingly Linked List otterSingly Linked List All benchmarks except otter are from olden suite

15x speedup

 Helper thread can be invoked before the main thread reaches the computation to overlap the overhead of finding the sub problems  Helper thread in general traverses a part of the data structure and takes very less time compared to the original function

 Open MP 3.0 supports task parallelism ◦ Directives can be added in the code to parallelize while loops and recursive functions  Open MP tasks doesn’t take application runtime information into consideration  Tasks tend to be fine grain  Significant performance overhead

 Speculative parallelization can help in parallelizing programs that are difficult to analyze  That comes at the cost of executing instructions which might not be useful ◦ Power and Performance overhead  Our approach is a non-speculative way of parallelization

 Traditional parallelization techniques cannot efficiently parallelize pointer intensive codes  Combining programmer’s knowledge and application runtime information we can exploit parallelism in such codes  The idea presented is not limited to trees and linked lists and can be extended to other dynamic structures like graphs

Questions ?

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Similar presentations

Presentation on theme: "Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Similar presentations

Presentation on theme: "Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University."— Presentation transcript:

Similar presentations

About project

Feedback