Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and Programmability for Heterogeneous Many-core Architectures.

1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.

Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.

April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

FOUNDATION IN INFORMATION TECHNOLOGY (CS-T-101) TOPIC : INFORMATION SYSTEM – SOFTWARE.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

These slides are based on the book:

High-level optimization Jakub Yaghob

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Conception of parallel algorithms

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.

Introduction to cosynthesis Rabi Mahapatra CSCE617

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

How Efficient Can We Be?: Bounds on Algorithm Energy Consumption

Presentation transcript:

Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee John Shalf Lawrence Berkeley National Laboratory SIAM Parallel Processing for Scientific Computing MS18: Dataflow 2.0: The Re-emergence and Evolution of Dataflow Programming Techniques for Parallel Computing March 13, 2008

New Class of Algorithms  LU, QR, Cholesky, one-sided, factorizations  Inspired by out-of-memory algorithms  Processing of input by small tiles  Block data layout  Fine granularity  Suitable for dynamic scheduling  LAPACK Working Note 190  LAPACK Working Note 191  LAPACK Working Note 184

New Class of Algorithms  Classic multi-core (x86)‏  Faster than MKL  Faster than ACML  Way faster than LAPACK  Exotic multi-core (CELL)‏  Extremely fast  No alternatives

New Algorithms – LU on x86

New Algorithms – QR on x86

New Algorithms – Cholesky on x86

New Algorithms – Cholesky on the CELL

Going “Out of Sync”  CELL Cholesky – 8 cores  CELL Cholesky – 16 cores

PLASMA Framework  Don't reinvent the wheel  SCHEDULE  HeNCE  Parametrized DAGs (Cosnard, Jeannot)‏  Dataflow Systems (Simulink, Gedae)‏  Systolic Arrays

PLASMA Goals (WHAT)‏  Scalability  Massive parallelism  Performance  Efficient runtime  Portability  Cluster / MPP  SMP / CMP  Out-of-memory / CELL

PLASMA Principles (HOW)‏  Data-driven (dynamic) execution  Explicit expression of parallelism  Implicit communication  Component architecture  “Language”  Runtime  Tools  Dataflow visualization  DAG visualization  Profiling  Tracing ..... There is no such thing as autoparallelization. BTW, There is no such thing as auto SIMD'zation. (see LAPACK Working Note 189)‏

Current Infrastructure MPI ScaLAPAC K LAPACK PBLAS BLAS BLACS

Current Infrastructure MPI LAPACK PBLAS BLAS BLACS ScaLAPAC K

Target Infrastructure TRAC E VI S PLASMA “Language ” μ-kernels (“μBLAS”)‏ RDMA-like (“μMPI”)‏ PROF PLASMARuntime

DAG / Task Graph  view computation as a DAG  use dependency / data driven scheduling  pursue the critical path

 DAGs can be huge  Difficult / impossible to store  Difficult / impossible to manage  I don't need to build the DAG  I need the ability to traverse the DAG DAG / Task Graph

 Producer Task  knows where its output goes  Consumer Task  knows where its input comes from  Nobody knows the DAG  Parents know their children  Children know their parents DAG / Task Graph

P P P P T T T T TT ,00,0 0,1 0,2 1,0 1,1 2, , 1, 2, 3 Task ID / Task Coordinates

P P P P T T T T TT ,00,0 0,1 0,2 1,0 1,1 2,0 // Task Name P // Execution Space i = 0... N // Parallel Partitioning i % Ncores = coreID // Parameters input  T ( i – 1, 0 )‏ output  T ( i, 0... i )‏ Task Definition

P P P P T T T T TT ,00,0 0,1 0,2 1,0 1,1 2,0 // Task Name T // Execution Space i = 0... N – 1 j = 0... N – 1 – i // Parallel Partitioning i % Ncores = coreID // Parameters input 1  P ( i )‏ input 2  T ( i – 1, j + 1 )‏ output  T ( i + 1, j – 1 )‏ Task Definition

// Name POTRF(k) ‏ // Execution space k = 0..SIZE // Parallel partitioning k % GRIDrows == rowRANK k % GRIDcols == colRANK // Parameters INOUT T <- (k == 0) ? IN(k, k) : T SYRK(k, k) ‏ -> T TRSM(k, k+1..SIZE) ‏ -> OUT(k, k) ‏ // Body lapack_spotrf(lapack_lower, NB, T, NB, &info); // Name SYRK(k, n) ‏ // Execution space k = 0..SIZE n = 0..k-1 // Parallel partitioning k % GRIDrows == rowRANK k % GRIDcols == colRANK // Parameters IN A <- C TRSM(n, k) ‏ INOUT T <- (n == 0) ? IN(k, k) : T SYRK(k, n-1) ‏ -> (n == k-1) ? T POTRF(k) : T SYRK(k, n+1) ‏ // Body cblas_ssyrk(CblasColMajor, CblasLower, CblasNoTrans, NB, NB, 1.0, A, NB, 1.0, T, NB); // Name TRSM(k, m) ‏ // Execution space k = 0..SIZE m = k+1..SIZE // Parallel partitioning m % GRIDrows == rowRANK k % GRIDcols == colRANK // Parameters IN T <- T POTRF(k) ‏ INOUT C <- (k == 0) ? IN(m, k) : C GEMM(k, m, k-1) ‏ -> A GEMM(m, m+1..SIZE, k) ‏ -> B GEMM(k+1..m-1, m, k) ‏ -> A SYRK(m, k) ‏ -> OUT(k, m) ‏ // Body cblas_strsm(CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, NB, NB, 1.0, T, NB, B, NB); // Name GEMM(k, m, n) ‏ // Execuiton space k = 0..SIZE m = k+1..SIZE n = 0..k-1 // Parallel partitioning m % GRIDrows == rowRANK k % GRIDcols == colRANK // Parameters IN A <- C TRSM(n, k) ‏ IN B <- C TRSM(n, m) ‏ INOUT C <- (n == 0) ? IN(m, n) : C GEMM(k, m, n-1) ‏ -> (n == k-1) ? C TRSM(k, m) : C GEMM(k, m, n+1) ‏ // Body cblas_sgemm(CblasColMajor, CblasNoTrans, CblasTrans, NB, NB, NB, 1.0, B, NB, A, NB, 1.0, C, NB); // Globals GRIDrows, GRIDcols, NB

dep req data dep req execute notify Task Life-Cycle

data request data request POTRF GEMM TRSM response response response data request POTRF completion notifications GEMM completion notifications TRSM data available TRSM dependencies satisfied TRSM completion notifications Task Life-Cycle

 no dependencies satisfied  some dependencies satisfied  waiting for all dependencies  all dependencies satisfied  some data delivered  waiting for all data  all data delivered  waiting for execution Task Life-Cycle

 Existing compiler technology  task body – standard C / Fortran  formulas – C / Fortran expressions  metadata – human readable ASCII Compilation

 Scheduler  management of task bins  Communication system  schared mem – pointer aliasing  distrib mem – messaging  out-of-memory – mix Runtime

 Standalone ASCII code  Little / No error checking  Visual tools to assist development  plot dataflow diagram  instantiate small DAG Visual Tools

POTRF T TRSM T C SYRK A T GEMM B C A OUT IN Dataflow Visualization

Task Graph Visualization  Tile LU  Tile QR  Tile Cholesky

Connectivity / Topology  4D cube  quad-core

Connectivity / Topology communication matrix should resemble connectivity matrix

Thank You ! Questions ? Questions ?