Prabhanjan Kambadur, Open Systems Lab, Indiana University

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Fine-grain Task Aggregation and Coordination on GPUs
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
PFunc: Modern Task Parallelism For Modern High Performance Computing Prabhanjan Kambadur, Open Systems Lab, Indiana University.
INSTITUTE OF COMPUTING TECHNOLOGY An Adaptive Task Creation Strategy for Work-Stealing Scheduling Lei Wang, Huimin Cui, Yuelu Duan, Fang Lu, Xiaobing Feng,
Parallelizing C Programs Using Cilk Mahdi Javadi.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
Frequent Itemset Mining on Graphics Processors, Fang et al., DaMoN Turbo-charging Vertical Mining of Large Databases, Shenoy et al., MOD NVIDIA.
A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.
OOPLs /FEN March 2004 Object-Oriented Languages1 Object-Oriented Languages - Design and Implementation Java: Behind the Scenes Finn E. Nordbjerg,
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Parallelism without Concurrency Charles E. Leiserson MIT.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,
Introduction to Operating Systems Concepts
Parallel Programming Models
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
TensorFlow– A system for large-scale machine learning
CILK: An Efficient Multithreaded Runtime System
CS427 Multicore Architecture and Parallel Computing
For Massively Parallel Computation The Chaotic State of the Art
CMPS 5433 Programming Models
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Constructing a system with multiple computers or processors
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Store Recycling Function Experimental Results
Chapter 4: Threads.
Chapter 4: Threads.
Multithreaded Programming in Cilk LECTURE 1
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Introduction to CILK Some slides are from:
Constructing a system with multiple computers or processors
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Introduction to parallelism and the Message Passing Interface
Chapter 4: Threads & Concurrency
Adaptive Data Refinement for Parallel Dynamic Programming Applications
Cilk and Writing Code for Hardware
Introduction to CILK Some slides are from:
Types of Parallel Computers
Presentation transcript:

Demand-driven Execution of Directed Acyclic Graphs Using Task Parallelism Prabhanjan Kambadur, Open Systems Lab, Indiana University With Anshul Gupta (IBM TJW), Torsten Hoefler (IU), and Andrew Lumsdaine (IU)

Kambadur, Gupta, Hoefler, and Lumsdaine Overview Motivation Background DAG execution Case study Conclusion Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Motivation Ubiquitous parallelism Multi-core, many-core and GPGPUs Support for efficient || execution of DAGs is a must Powerful means of expressing application-level ||ism Task parallelism does not offer complete support yet Not studying DAG scheduling! Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Dataflow models Powerful parallelization model for applications Id, Sisal, LUSTRE, BLAZE family of languages Classified based on order of DAG nodes’ execution Data-driven dataflow model Computations initiated when all inputs become available Demand-driven dataflow model Computations initiated when inputs are needed Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Fibonacci int fib (int n) { if (0==n || 1==n) return (n); else return (fib (n-1) + fib (n-2)); } Kambadur, Gupta, Hoefler, and Lumsdaine

Task parallelism and Cilk Program broken down into smaller tasks Independent tasks are executed in parallel Generic model of parallelism Subsumes data parallelism and SPMD parallelism Cilk is the best-known implementation Leiserson et al C and C++, shared memory Introduced the work-stealing scheduler Guaranteed bounds on space and time Because of fully-strict computation model Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Parallel Fibonacci cilk int fib (int n) { if (0==n || 1==n) return (n); else { int x = spawn fib (n-1); int y = spawn fib (n-2); sync; return (x+y); } 1. Each task has exactly one parent. 2. All tasks returns to respective parents Demand-driven execution! Kambadur, Gupta, Hoefler, and Lumsdaine

Classic task parallel DAG execution Flow of Data Flow of Demand Data-driven! Kambadur, Gupta, Hoefler, and Lumsdaine

Demand-driven parallel DAG execution Flow of Data Flow of Demand Does not follow the fully strict model Multiple completion notifications Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine What is different? In a large DAG Spawning/completion order of nodes is different Altered data locality Lifetime of dynamic memory is affected Altered memory profile In a DAG with very few roots Control over parallelization Shut off parallelism at lower-levels Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine PFunc: An overview Library-based solution for task parallelism C and C++ APIs, shared memory Extends existing task parallel feature set Cilk, Threading Building Blocks (TBB), Fortran M, etc Customizable task scheduling cilkS, prioS, fifoS, and lifoS provided Multiple task completion notifications on demand Deviates from the strict computation model OpenMP, Tascell, Tpascal, MulTScheme left out on purpose. Idea here is to identify the core features of task parallelism and lift it – like any good generic programmer! Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Case Study Kambadur, Gupta, Hoefler, and Lumsdaine

Sparse unsymmetric Cholesky factorization Flow of Data Flow of Demand L’ Update Matrix U’ Frontal Matrix The DAGs were generated in the symbolic phase of the sparse unsymetric multifrontal cholesky solver. Two pieces of information are given to us – the structure and the weight. With the weight, we can calculate both the memory requirements and the computational structure (weight) of each node. The only difference is that in the real problem, the calculated LU is not always evenly distributed to the children. So the memory access patterns might be slightly different from the original. However, takes too much time to implement an actual solver. Each node allocates memory Memory is freed when all children are executed Short and stubby DAGs with one root Kambadur, Gupta, Hoefler, and Lumsdaine

Demand-driven DAG execution Kambadur, Gupta, Hoefler, and Lumsdaine

DAG execution: Runtime Hardware: Dual Intel X5365 quad-cores (8 cores). Software: Cilk 5.4.6, SunCC 5.10, GCC 4.3.2, Linux 2.6.27, TBB 2.1. Kambadur, Gupta, Hoefler, and Lumsdaine

DAG execution: Peak memory usage Hardware: Dual Intel X5365 quad-cores (8 cores). Software: Cilk 5.4.6, SunCC 5.10, GCC 4.3.2, Linux 2.6.27, TBB 2.1. Kambadur, Gupta, Hoefler, and Lumsdaine

Conclusions Need to support demand-driven DAG execution Promotes user-driven optimizations PFunc increases tasking support for Demand-driven DAG execution Multiple completion notifications Customizable task scheduling policies Future work Parallelize more applications Incorporate support for GPGPUs https://projects.coin-or.org/PFunc Kambadur, Gupta, Hoefler, and Lumsdaine

Kambadur, Gupta, Hoefler, and Lumsdaine Questions? Kambadur, Gupta, Hoefler, and Lumsdaine