1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Slides:



Advertisements
Similar presentations
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Advertisements

Parallel Processing with OpenMP
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.
Multiscalar processors
GCSE Computing - The CPU
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Massively LDPC Decoding on Multicore Architectures Present by : fakewen.
1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Introduction To Computer Architecture Jean-Michel RICHER University of Angers France January 2003.
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.
R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.
Martin Kruliš by Martin Kruliš (v1.1)1.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Single Node Optimization Computational Astrophysics.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.
1 Lecture 5a: CPU architecture 101 boris.
GCSE Computing - The CPU
Generating Families of Practical Fast Matrix Multiplication Algorithms
Stanford University.
TI Information – Selective Disclosure
Distributed Processors
Using BLIS Building Blocks:
Efficient Array Slicing on the Intel Xeon Phi Coprocessor
Multi-core processors
High-Performance Matrix Multiplication
Core i7 micro-processor
Hyperthreading Technology
Computer Architecture Lecture 4 17th May, 2006
STUDY AND IMPLEMENTATION
Project Title Team Members EE/CSCI 451: Project Presentation
Using BLIS Building Blocks:
CS 252 Project Presentation
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Fine-grained vs Coarse-grained multithreading
The Vector-Thread Architecture
Operating System Introduction.
CS 286 Computer Organization and Architecture
Implementation of a De-blocking Filter and Optimization in PLX
GCSE Computing - The CPU
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
EN Software Carpentry Python – A Crash Course Esoteric Sections Parallelization
Run time performance for all benchmarked software.
Presentation transcript:

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee

Introduction Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism Explore the opportunities for parallelism to explain which we will exploit Need finer grain parallelism 2

Outline  GotoBLAS approach Opportunities for Parallelism Many-threaded Results 3

GotoBLAS Approach 4 += A BC m m nk k n The GEMM operation:

Main Memory L3 cache L2 cache += L1 cache registers

6 Main Memory L3 cache L2 cache += L1 cache registers ncnc ncnc

7 Main Memory L3 cache L2 cache += L1 cache registers kckc kckc

8 Main Memory L3 cache L2 cache += L1 cache registers mcmc mcmc

9 Main Memory L3 cache L2 cache += L1 cache registers nrnr nrnr nrnr

10 Main Memory L3 cache L2 cache += L1 cache registers mrmr mrmr

Outline GotoBLAS approach  Opportunities for Parallelism Many-threaded Results 11

3 Loops to Parallelize in GotoBLAS 12 +=

5 Opportunities for Parallelism 13 +=

Multiple Levels of Parallelism 14 irir += All threads share micro-panel of B Each thread has its own micro-panel of A Fixed number of iterations:

Multiple Levels of Parallelism 15 += jrjr jrjr All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L2 cache

Multiple Levels of Parallelism 16 All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L2 caches

Multiple Levels of Parallelism 17 Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

Multiple Levels of Parallelism Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

Multiple Levels of Parallelism 19 All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L3 caches Good for NUMA reasons

Outline GotoBLAS approach Opportunities for Parallelism  Many-threaded Results 20

Intel Xeon Phi Many Threads  60 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Each core has its own L2 cache 21

22

23

24

25

IBM Blue Gene/Q (Not quite as) Many Threads  16 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Single large, shared L2 cache 26

27

28

29

30

31

32

Thank You Questions? Source code available at:  code.google.com/p/blis/ 33