March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Spark: Cluster Computing with Working Sets

April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Chapter Hardwired vs Microprogrammed Control Multithreading

INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Computer System Architectures Computer System Software

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

PACC2011, Sept [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Martin Kruliš by Martin Kruliš (v1.1)1.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang

June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Conception of parallel algorithms

Prabhanjan Kambadur, Open Systems Lab, Indiana University

A task-based implementation for GeantV

Computer Engg, IIT(BHU)

Task Scheduling for Multicore CPUs and NUMA Systems

Parallel Algorithm Design

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Coding FLAME Algorithms with Example: Cholesky factorization

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Presentation transcript:

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan

March 22, 2010Intel talk2 Motivation “The Free Lunch Is Over” – Herb Sutter  Parallelize or perish  Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten FORTRAN 77 Column-major order matrix storage 187+ operations for each datatype One routine (algorithm) per operation ( L A P A C K ) ( L -A P -A C -K ) ( L A P A -C -K ) ( L -A P -A -C K ) ( L A -P -A C K ) ( L -A -P A C -K )

March 22, 2010Intel talk3 Teaser Better Theoretical Peak Performance

March 22, 2010Intel talk4 Goals Programmability  Use tools provided by FLAME Parallelism  Directed acyclic graph (DAG) scheduling

March 22, 2010Intel talk5 Outline Introduction SuperMatrix Scheduling Performance Conclusion

March 22, 2010Intel talk6 SuperMatrix Formal Linear Algebra Method Environment (FLAME)  High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T

March 22, 2010Intel talk7 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /* */ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

March 22, 2010Intel talk8 SuperMatrix Cholesky Factorization  Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *

March 22, 2010Intel talk9 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

March 22, 2010Intel talk10 SuperMatrix FLASH  Storage-by-blocks, algorithm-by-blocks

March 22, 2010Intel talk11 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

March 22, 2010Intel talk12 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 Chol( A 0,0 )

March 22, 2010Intel talk13 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T

March 22, 2010Intel talk14 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T

March 22, 2010Intel talk15 SuperMatrix Cholesky Factorization  Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T

March 22, 2010Intel talk16 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization  Iteration 3 CHOL 9 Chol( A 2,2 )

March 22, 2010Intel talk17 SuperMatrix Separation of Concerns  Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task  Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel

March 22, 2010Intel talk18 SuperMatrix Analyzer  Detect flow, anti, and output dependencies  Embed pointers into hierarchical matrices  Block size manifests as size of contiguously stored blocks  Can be performed statically

March 22, 2010Intel talk19 Outline Introduction SuperMatrix Scheduling Performance Conclusion

March 22, 2010Intel talk20 Scheduling Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end

March 22, 2010Intel talk21 Scheduling Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end

March 22, 2010Intel talk22 Scheduling Supermarket  p lines for each p cashiers  Efficient enqueue and dequeue  Schedule depends on task to thread assignment Bank  1 line for p tellers  Enqueue and dequeue become bottlenecks  Dynamic dispatching of tasks to threads

March 22, 2010Intel talk23 … Scheduling Single Queue  Set of all ready and available tasks  FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue

March 22, 2010Intel talk24 … … Scheduling Multiple Queues  Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue

March 22, 2010Intel talk25 Scheduling Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end  Enqueue Place all dependent tasks on queue of same thread that executes task  Steal Select random thread and remove a task from tail of its queue

March 22, 2010Intel talk26 Scheduling Work Stealing Mailbox  Each thread has an associated mailbox  Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution  Before attempting a steal, first check mailbox  Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals

March 22, 2010Intel talk27 Scheduling Data Affinity  Assign all tasks that write to a particular block to the same thread  Owner computes rule  2D block cyclic distribution Execution Trace  Cholesky factorization: 4000×4000  Total time: 2D data affinity ~ FIFO queue  Idle threads: 2D ≈ 27% and FIFO ≈ 17%

March 22, 2010Intel talk28 Scheduling Data Granularity  Cost of task >> enqueue and dequeue Single vs. Multiple Queues  FIFO queue increases load balance  2D data affinity decreases data communication  Combine best aspects of both

March 22, 2010Intel talk29 … Scheduling Cache Affinity  Single priority queue sorted by task height  Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue

March 22, 2010Intel talk30 Scheduling  Enqueue Insert task Sort queue via task heights  Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity  Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task

March 22, 2010Intel talk31 Scheduling Optimizations  Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution  Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency

March 22, 2010Intel talk32 Outline Introduction SuperMatrix Scheduling Performance Conclusion

March 22, 2010Intel talk33 Performance Target Architecture  4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket  OpenMP Intel compiler 11.1  BLAS Intel MKL 10.2

March 22, 2010Intel talk34 Performance Implementations  SuperMatrix + serial MKL FIFO queue, cache affinity  FLAME + multithreaded MKL  Multithreaded MKL  PLASMA + serial MKL  Double precision real floating point arithmetic  Tuned block size

March 22, 2010Intel talk35 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA)  Create persistent POSIX threads  Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed  Copy matrix from column-major order storage to block data layout and back to column-major  Does not address programmability Innovative Computing Laboratory University of Tennessee

March 22, 2010Intel talk36 Performance

March 22, 2010Intel talk37 Performance

March 22, 2010Intel talk38 Performance

March 22, 2010Intel talk39 Performance

March 22, 2010Intel talk40 Performance

March 22, 2010Intel talk41 Performance Inversion of a Symmetric Positive Definite Matrix  Cholesky factorization A → L L T CHOL  Inversion of a triangular matrix R := L -1 TRINV  Triangular matrix multiplication by its transpose A -1 := R T R TTMM

March 22, 2010Intel talk42 Performance

March 22, 2010Intel talk43 Performance

March 22, 2010Intel talk44 Performance

March 22, 2010Intel talk45 Performance

March 22, 2010Intel talk46 Performance

March 22, 2010Intel talk47 Performance

March 22, 2010Intel talk48 Performance

March 22, 2010Intel talk49 Performance

March 22, 2010Intel talk50 Performance

March 22, 2010Intel talk51 Performance LU Factorization with Partial Pivoting P A = L U  In practice is numerically stable LU Factorization with Incremental Pivoting  Maps well to algorithm-by-blocks  Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i

March 22, 2010Intel talk52 Performance

March 22, 2010Intel talk53 Performance

March 22, 2010Intel talk54 Performance

March 22, 2010Intel talk55 Performance Results  Cache affinity vs. FIFO queue  SuperMatrix out-of-order vs. PLASMA in-order  High variability of work stealing vs. predictable cache affinity performance  Representative performance of other dense linear algebra operations

March 22, 2010Intel talk56 Outline Introduction SuperMatrix Scheduling Performance Conclusion

March 22, 2010Intel talk57 Conclusion Separation of Concerns  Allows us to experiment with different scheduling algorithms Locality, Locality, Locality  Data communication is important as load balance for scheduling matrix computations

March 22, 2010Intel talk58 Conclusion Future Work  Intel Single-chip Cloud Computer Master-slave approach Software-managed cache coherency RCCE API RCCE_send RCCE_recv RCCE_shmalloc

March 22, 2010Intel talk59 Acknowledgments Robert van de Geijn, Field Van Zee  I thank the other members of the FLAME team for their support Funding  Intel  Microsoft  NSF grants CCF– CCF–

March 22, 2010Intel talk60 References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, UT, USA, February [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.

March 22, 2010Intel talk61 Conclusion More Information Questions?