R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A 0,0 := CHOL (A 0,0 ) T RSM 2 A 2,0 := A 2,0 A 0,0 -T G EMM 5 A 2,1 := A 2,0 A 1,0 T S YRK 7 A 2,2 := A 2,0 A 2,0 T T RSM 3 A 3,0 := A 3,0 A 0,0 -T G EMM 6 A 3,1 := A 3,0 A 1,0 T G EMM 8 A 3,2 := A 3,0 A 2,0 T S YRK 9 A 3,3 := A 3,0 A 3,0 T T RSM 1 A 1,0 := A 1,0 A 0,0 -T S YRK 4 A 1,1 := A 1,0 A 1,0 T T RSM 11 A 2,1 := A 2,1 A 1,1 -T S YRK 13 A 2,2 := A 2,1 A 2,1 T T RSM 12 A 3,1 := A 3,1 A 1,1 -T G EMM 14 A 3,2 := A 3,1 A 2,1 T S YRK 15 A 3,3 := A 3,1 A 3,1 T C HOL 10 A 1,1 := CHOL (A 1,1 ) C HOL 16 A 2,2 := CHOL (A 2,2 ) T RSM 17 A 3,2 := A 3,2 A 2,2 -T S YRK 18 A 3,3 := A 3,2 A 3,2 T C HOL 19 A 3,3 := CHOL (A 3,3 ) A BSTRACT C HOLESKY F ACTORIZATION A LGORITHM - BY -B LOCKS D IRECTED A CYCLIC G RAPH Q UEUEING T HEORY P ERFORMANCE S UPER M ATRIX A → L L T where A is a symmetric positive definite matrix and L is a lower triangular matrix. Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization I TERATION 1 I TERATION 2 I TERATION 3 I TERATION 4 We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits. C HOL 0 C HOL 10 C HOL 16 T RSM 1 T RSM 2 T RSM 3 T RSM 12 T RSM 11 S YRK 4 S YRK 7 S YRK 9 G EMM 5 G EMM 6 G EMM 8 S YRK 13 S YRK 15 G EMM 14 T RSM 17 S YRK 18 C HOL 19 C ONCLUSION Performance of Cholesky factorization using several high-performance implementations (left) and finding the best block size for each problem size using SuperMatrix w/cache affinity (right) Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks Reformulate a blocked algorithm to an algorithm-by-blocks by decomposing sub-operations into component operations on blocks (tasks). Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks R EFERENCES [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, UT, USA, February [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain. This research is partially funded by National Science Foundation (NSF) grants CCF and CCF and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee. Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph. T RSM 1 reads A 0,0 after C HOL 0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks. A NALYZER D ISPATCHER Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph. Delay the execution of operations and instead store all tasks in global task queue sequentially. Internally calculate all data dependencies between tasks only using the input and output parameters for each task. Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies. Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel. foreach task in DAG do if task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end PE 1 D EQUEUE E NQUEUE PE p D EQUEUE PE 1 D EQUEUE E NQUEUE PE p D EQUEUE E NQUEUE Multi-queue multi-server system Single-queue multi-server system A single-queue multi-server system attains better load balance than a multi-queue multi-server system. Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks. We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously. FLA_Error FLASH_Chol_l_var3( FLA_Obj A ) { FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS; }