R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

U of Houston – Clear Lake

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Multiple Processor Systems

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 The Component Interaction Domain: Modeling Event-Driven and Demand- Driven Applications.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

SE: CHAPTER 7 Writing The Program

Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Static Process Scheduling Section 5.2 CSc 8320 Alex De Ruiter

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.

PACC2011, Sept [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.

Processor Architecture

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Static Process Scheduling

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Function Level Parallelism Driven by Data Dependencies By Sean Rul, Hans Vandierendonck, Koen De Bosschere dasCMP 2006, December 10.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference.

May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Generating Families of Practical Fast Matrix Multiplication Algorithms

Ioannis E. Venetis Department of Computer Engineering and Informatics

Conception of parallel algorithms

A task-based implementation for GeantV

Hierarchical Architecture

Lecture 2 of Computer Science II

Coding FLAME Algorithms with Example: Cholesky factorization

Multiple Processor Systems

Rick Molloy Senior Developer Microsoft Startup Business Group

Presentation transcript:

R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A 0,0 := CHOL (A 0,0 ) T RSM 2 A 2,0 := A 2,0 A 0,0 -T G EMM 5 A 2,1 := A 2,0 A 1,0 T S YRK 7 A 2,2 := A 2,0 A 2,0 T T RSM 3 A 3,0 := A 3,0 A 0,0 -T G EMM 6 A 3,1 := A 3,0 A 1,0 T G EMM 8 A 3,2 := A 3,0 A 2,0 T S YRK 9 A 3,3 := A 3,0 A 3,0 T T RSM 1 A 1,0 := A 1,0 A 0,0 -T S YRK 4 A 1,1 := A 1,0 A 1,0 T T RSM 11 A 2,1 := A 2,1 A 1,1 -T S YRK 13 A 2,2 := A 2,1 A 2,1 T T RSM 12 A 3,1 := A 3,1 A 1,1 -T G EMM 14 A 3,2 := A 3,1 A 2,1 T S YRK 15 A 3,3 := A 3,1 A 3,1 T C HOL 10 A 1,1 := CHOL (A 1,1 ) C HOL 16 A 2,2 := CHOL (A 2,2 ) T RSM 17 A 3,2 := A 3,2 A 2,2 -T S YRK 18 A 3,3 := A 3,2 A 3,2 T C HOL 19 A 3,3 := CHOL (A 3,3 ) A BSTRACT C HOLESKY F ACTORIZATION A LGORITHM - BY -B LOCKS D IRECTED A CYCLIC G RAPH Q UEUEING T HEORY P ERFORMANCE S UPER M ATRIX A → L L T where A is a symmetric positive definite matrix and L is a lower triangular matrix. Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization I TERATION 1 I TERATION 2 I TERATION 3 I TERATION 4 We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits. C HOL 0 C HOL 10 C HOL 16 T RSM 1 T RSM 2 T RSM 3 T RSM 12 T RSM 11 S YRK 4 S YRK 7 S YRK 9 G EMM 5 G EMM 6 G EMM 8 S YRK 13 S YRK 15 G EMM 14 T RSM 17 S YRK 18 C HOL 19 C ONCLUSION Performance of Cholesky factorization using several high-performance implementations (left) and finding the best block size for each problem size using SuperMatrix w/cache affinity (right) Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks Reformulate a blocked algorithm to an algorithm-by-blocks by decomposing sub-operations into component operations on blocks (tasks). Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks R EFERENCES [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, UT, USA, February [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain. This research is partially funded by National Science Foundation (NSF) grants CCF and CCF and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee. Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph. T RSM 1 reads A 0,0 after C HOL 0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks. A NALYZER D ISPATCHER Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph.  Delay the execution of operations and instead store all tasks in global task queue sequentially.  Internally calculate all data dependencies between tasks only using the input and output parameters for each task.  Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies.  Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel. foreach task in DAG do if task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end PE 1 D EQUEUE E NQUEUE PE p D EQUEUE PE 1 D EQUEUE E NQUEUE PE p D EQUEUE E NQUEUE Multi-queue multi-server system Single-queue multi-server system A single-queue multi-server system attains better load balance than a multi-queue multi-server system. Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks. We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously. FLA_Error FLASH_Chol_l_var3( FLA_Obj A ) { FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS; }