Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Slides:



Advertisements
Similar presentations
Chapter 22 Implementing lists: linked implementations.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Lecture 10: Heap Management CS 540 GMU Spring 2009.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Intermediate code generation. Code Generation Create linear representation of program Result can be machine code, assembly code, code for an abstract.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Introduction to Code Generation Mooly Sagiv html:// Chapter 4.
Tentative Schedule 20/12 Interpreter+ Code Generation 27/12 Code Generation for Control Flow 3/1 Activation Records 10/1 Program Analysis 17/1 Register.
Copyright © 2008 Pearson Addison-Wesley. All rights reserved. Chapter 13 Pointers and Linked Lists.
Multiscalar processors
Introduction to Code Generation Mooly Sagiv html:// Chapter 4.
Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.
Automatic Generation of Parallel OpenGL Programs Robert Hero CMPS 203 December 2, 2004.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
June 27, 2002 HornstrupCentret1 Using Compile-time Techniques to Generate and Visualize Invariants for Algorithm Explanation Thursday, 27 June :00-13:30.
Implementing Processes and Process Management Brian Bershad.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 10, 10/30/2003 Prof. Roy Levow.
Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
Workflow Early Start Pattern and Future's Update Strategies in ProActive Environment E. Zimeo, N. Ranaldo, G. Tretola University of Sannio - Italy.
Thread-Level Speculation Karan Singh CS
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
OOPLs /FEN March 2004 Object-Oriented Languages1 Object-Oriented Languages - Design and Implementation Java: Behind the Scenes Finn E. Nordbjerg,
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Theory of Programming Languages Introduction. What is a Programming Language? John von Neumann (1940’s) –Stored program concept –CPU actions determined.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Getting ready. Why C? Design Features – Efficiency (C programs tend to be compact and to run quickly.) – Portability (C programs written on one system.
/ PSWLAB Evidence-Based Analysis and Inferring Preconditions for Bug Detection By D. Brand, M. Buss, V. C. Sreedhar published in ICSM 2007.
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Recursion ITI 1121 N. El Kadri. Reminders about recursion In your 1 st CS course (or its equivalent), you have seen how to use recursion to solve numerical.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
CSC 143 P 1 CSC 143 Recursion [Chapter 5]. CSC 143 P 2 Recursion  A recursive definition is one which is defined in terms of itself  Example:  Compound.
Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Conception of parallel algorithms
Improving java performance using Dynamic Method Migration on FPGAs
CSCI1600: Embedded and Real Time Software
Processes Hank Levy 1.
Market-based Dynamic Task Allocation in Mobile Surveillance Systems
Processes Hank Levy 1.
rePLay: A Hardware Framework for Dynamic Optimization
CSCI1600: Embedded and Real Time Software
CSc 453 Interpreters & Interpretation
Presentation transcript:

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University

 Motivation  Introduction  Cooperative Parallelization  Programmer’s Input  Evaluation  Conclusion

 Program parallelization is a difficult task  Automatic parallelization helps in parallelizing sequential applications  Most of the parallelizing techniques focus on array based applications  Limited support for parallelizing pointer-intensive applications

void traverse_tree (Tree *tree) { if (tree−>left) traverse_tree(tree->left); if (tree->right) traverse_tree(tree->right); process(tree); } void traverse_list (List * list) { List * node = list; while ( node != NULL ) { process(node); node = node−>next; } } Tree Traversal List Traversal

 Program Parallelization is a 2-fold problem  First Problem: Finding where parallelism is available in the application if any  Second Problem: Deciding how to efficiently exploit the available parallelism

 Use static analysis to perform dependence checking and identify independent parts of the program  Target regular structures like arrays and for loops  Pointer intensive codes cannot be analyzed accurately with static analysis

 Pointer intensive applications typically have ◦ Data structures built from input ◦ and while loops to traverse the data structures  Without the points-to information and with out loop counts there is very little we can do at compile time

 In array based applications with for loops sets of iterations are distributed to different threads  In pointer intensive applications information about the data structure is needed to run the parallel code

 The programmer has high level view of the program and can give hints about the program  Hints can indicate things like ◦ If a loop can be parallelized ◦ If function calls are independent ◦ Structure of the working data  All of these bits of information are vital in program parallelization

 To efficiently exploit parallelism in pointer intensive applications we need runtime information ◦ Size and shape of data structure (dependent on input) ◦ Points-to information  Using the points-to information we determine the work distribution

Cooperative Parallelization Cooperative Parallelization Programmer (hints) Compiler Runtime System Sequential Program Parallel Program

 Cooperation between the programmer, the compiler and the runtime system to identify and efficiently exercise parallelism in pointer intensive applications  The task of identifying parallelism in the code is delegated to the programmer  Runtime system is responsible for monitoring the program and efficiently executing parallel code

 Pointer-intensive applications ◦ A data structure is built from the input ◦ The data structure is traversed several times and nodes are processed  The operations on nodes are typically independent  This fact can be obtained from the programmer as a hint

int perimeter (QuadTree tree, int size) { int retval = 0; if (tree−>color==grey) { /*node has children */ retval += perimeter (tree−>nw, size/2); retval += perimeter (tree−>ne, size/2); retval += perimeter (tree−>sw, size/2); retval += perimeter (tree−>se, size/2); } else if (tree−>color==black) {... /* do something on the node*/ } return retval; } tree nw subtree se subtree … Function from perimeter benchmark

void compute_node (node_t * nodelist) { int i; while ( nodelist != NULL ) { for (i=0; i from_count; i++) { node_t *other_node = nodelist−>from_nodes[i]; double coeff = nodelist−>coeffs[i]; double value = other_node−>value; nodelist−>value −= coeff * value; } nodelist = nodelist−>next } Function from em3d benchmark sublist 1 sublist n... head

 Processing of different parts of the data structure (sub problems) can be done in parallel  Needs access to multiple sub problems at runtime  The task of finding these sub problems in the data structure is done by a helper thread

 The helper thread goes over the data structure and finds multiple independent sub problems  The helper thread doesn’t need to traverse the whole data structure to find the sub problems  Using a separate thread for finding the sub problems reduces the overhead

loop Sequential Execution Parallel Execution helper thread application threads loop

helper thread: wait for signal from main thread find subproblems in the data structure signal main thread application thread: wait for signal from main thread work on the subproblems assigned to this thread signal main thread main thread: signal helper thread when data structure is ready wait for signal from helper thread distribute subproblems to application threads signal application threads wait for signal from application threads merge results from all the application threads

 The runtime information collected is used to determine the profitability of parallelization  This decision can be driven by the programmer using a hint  The program is parallelized only if the data structure is “big” enough

 Interface between the programmer and the compiler  Should be simple to use with minimal essential information #parallel tree function (threads) (degree) (struct) {children} threshold [reduction] #parallel llist function (threads) (struct) (next_node) threshold [number]

 Implemented a source-to-source translator  Modified C language grammar to understand the hints Parser Generator Modified C grammar Translator C program with hints Parallel program

Platform Simics Simulator 16 core hardware 32-bit Linux OS BenchmarksData Structure bisortBinary Tree treeAddBinary Tree tspBinary Tree perimeterQuad Tree em3dSingly Linked List mstSingly Linked List otterSingly Linked List All benchmarks except otter are from olden suite

15x speedup

 Helper thread can be invoked before the main thread reaches the computation to overlap the overhead of finding the sub problems  Helper thread in general traverses a part of the data structure and takes very less time compared to the original function

 Open MP 3.0 supports task parallelism ◦ Directives can be added in the code to parallelize while loops and recursive functions  Open MP tasks doesn’t take application runtime information into consideration  Tasks tend to be fine grain  Significant performance overhead

 Speculative parallelization can help in parallelizing programs that are difficult to analyze  That comes at the cost of executing instructions which might not be useful ◦ Power and Performance overhead  Our approach is a non-speculative way of parallelization

 Traditional parallelization techniques cannot efficiently parallelize pointer intensive codes  Combining programmer’s knowledge and application runtime information we can exploit parallelism in such codes  The idea presented is not limited to trees and linked lists and can be extended to other dynamic structures like graphs

Questions ?