Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical.

Slides:

Advertisements

Similar presentations

Advertisements

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Lecture 6: Multicore Systems

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Optimization on Kepler Zehuan Wang

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Goal: Describe Pipelining

Lecture: Pipelining Basics

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Data Locality CS 524 – High-Performance Computing.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Figure 1.1 Interaction between applications and the operating system.

Lecture 16: Basic CPU Design

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Lecture: Pipelining Basics

1 Manchester Mark I, This was the second (the first was a small- scale prototype) machine built at Cambridge. A production version of this computer.

Christoph Höhne Michael Kaufmann.  Motivation  Super Scalar Architecture  Implementation Details  Annoying Hazards  Comparisons  Conclusion.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Lecture 16: Basic Pipelining

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Introduction to Computer Organization Pipelining.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.

CS 352H: Computer Systems Architecture

William Stallings Computer Organization and Architecture 8th Edition

Lecture 16: Basic Pipelining

High-Performance Matrix Multiplication

CDA 3101 Spring 2016 Introduction to Computer Organization

Henk Corporaal TUEindhoven 2009

Performance monitoring on HP Alpha using DCPI

Automatic Performance Tuning

Morgan Kaufmann Publishers The Processor

Lecture 16: Basic Pipelining

CSC 4250 Computer Architectures

Henk Corporaal TUEindhoven 2011

Mattan Erez The University of Texas at Austin

Adapted from the slides of Prof

Mattan Erez The University of Texas at Austin

Today: a look to the future

Lecture: Pipelining Basics

Presentation transcript:

Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Richard M. Veras, Tze Meng Low & Franz Franchetti Carnegie Mellon University Tyler Smith & Robert van de Geijn University of Texas at Austin

Carnegie Mellon Introduction: What is the problem? What is our solution  Kernel within a kernel  Enumerating a space of potential implementations for kernel-in- kernel and estimating their theoretical bounds.  Mechanical conversion of that candidate kernel-in-kernel into a high performance implementation (try to get as close to the bounds as we can) 4 th loop around micro-kernel 3 rd loop around micro-kernel mRmR mRmR 1 += kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR Pack A i → A i ~ Pack B p → B p ~ nRnR ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC L3 cache L2 cache L1 cache registers main memory 1 st loop around micro-kernel 2 nd loop around micro-kernel micro-kernel AiAi

Carnegie Mellon Introduction: What is our Solution  Kernel within a kernel  An approach for finding high performance candidate implementations.  A mechanical method for transforming that candidate into a high performance implementation.

Carnegie Mellon Overview Introduction Background  Kernel within a kernel  Candidate Components that cover the kernel Implementation: Reaching theoretical performance  Keeping the non-obvious bottlenecks away  Statically scheduling to deal with the remaining bottlenecks Results and Analysis Summary

Carnegie Mellon Background: A kernel for the BLIS kernel += Outer Product mc LDA LDB MUL ADD --- PERM MUL ADD A0 A1 B0 B1 C00 C01 C10 C11 B0 B1 A0 A1

Carnegie Mellon Background: Outer Product to Component PERM B0 B1 B2 B3 B1 B0 B3 B2 B3 B2 B1 B0 B2 B3 B0 B1 B0 LDB A0 A1 A2 A3 LDA B0 PERM FMA A0 A1 A2 A3 LDA B0B1B2B3 LDB FMA

Carnegie Mellon Implementing in context of the kernel LDA LDB MUL ADD --> PERM MUL ADD --> PERM MUL ADD --> PERM MUL ADD LDA vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD A0 A1 A2 A3 A4 A5 A6 A7 B0 B1B2 B3

Carnegie Mellon Implementing a High Performance Kernel Remove any obstacles that will place the bottleneck anywhere but the multiply-add units.. Statically schedule what we have to insure that multiply-add instructions retire at the predicted peak rate.

Carnegie Mellon Implementation: Avoiding Bottlenecks Picking Instructions with shorter encodings : vmovapd f 28 c0 vmovaps 0f c0 Keeping Offsets between -128 and 127: add 0x40, rdi c7 40 add 0x100,rdi c Using Low Number Registers: addpd xmm0,xmm f 58 c0 addpd xmm0,xmm c8

Carnegie Mellon Statically Scheduling Instructions We have the vector instructions that will get us the bounds we would like, we have tweaked those instructions so that we are still bounded by the multiply- add, now we have to do the last mile of insuring that instruction stalls do not prevent us from executed at the peak rate of our floating point unit. Static scheduling is necessary to prevent stalls, even on OOO, because it helps us hide bubbles that occur as the processor is computing near peak. TO do this is we need to manage our usage of registers We need to break components down into pieces that can be scheduled. Then we need to schedule the component as a whole. s LD ALU WR 5 s 6 s 7 // Prolog: READ(0) READ(1) ADD(0) ---- ADD(1) READ(2) // Steady State: for(p = 3; p < K; p++ ) WRITE(p-3) ---- ADD(p-1) READ(p) // Epilog: ---- WRITE(K-2) ---- ADD(K-1) WRITE(K-1) RD AD D WR LD ALU WR Clock Cycle for(p = 0; p < K; p++ ) r1 = a[p] // 1 cycle r2 = r1+1 // 2 cycles a[p] = r2 // 1 cycle [Lam, 1988]

Carnegie Mellon Managing our Usage of Registers LDA LDB MUL ADD --> PERM MUL ADD --> PERM MUL ADD --> PERM MUL ADD LDA vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD LDA LDB PERM MUL ADD LDA MUL ADD MUL ADD PERM MUL ADD

Carnegie Mellon Scheduling the Partial Components LDA LDB PERM MUL ADD LDA MUL ADD MUL ADD PERM MUL ADD // Code ADD MUL PERM LDA LDB PERM MUL ADD LDA MUL ADD MUL ADD PERM MUL ADD

Carnegie Mellon Overview Introduction Background  Kernel within a kernel  Candidate Components that cover the kernel Implementation: Reaching theoretical performance  Keeping the non-obvious bottlenecks away  Statically scheduling to deal with the remaining bottlenecks Results and Analysis Summary

Carnegie Mellon Sequential Results: HSW and SNB

Carnegie Mellon Sequential Results: NEH and PEN

Carnegie Mellon Spilling Results

Carnegie Mellon Partial Component Shape LDA LDB MUL ADD --> PERM MUL ADD --> PERM MUL ADD --> PERM MUL ADD LDA vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD LDA LDB MUL ADD --> PERM MUL ADD --> PERM MUL ADD --> PERM MUL ADD LDA vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD --> vvv MUL ADD Versus

Carnegie Mellon Instruction Selection Model

Carnegie Mellon Moving Forward We wanted to take the mystery out of implementing high performance Blis kernel using the double precision outer- product kernel within the micro-kernel  Look at other options and data types  Capture all other architectures because of recurring motifs.

Carnegie Mellon Summary Distill to kernel in a kernel Determined how we estimate the best theoretical options Show how we can mechanically turn that selection into an implementation Presented options for the future

Carnegie Mellon Background: BLIS 4 th loop around micro-kernel 3 rd loop around micro-kernel mRmR mRmR 1 += kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR Pack A i → A i ~ Pack B p → B p ~ nRnR ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC L3 cache L2 cache L1 cache registers main memory 1 st loop around micro-kernel 2 nd loop around micro-kernel micro-kernel AiAi += Outer Product

Carnegie Mellon The Tools we Have to Work With A0 A1 LDA LDB B0B1 PERM FMA LDA A0 A1 LDB B0B1 FMA PERM B0B0 B0 B1

Carnegie Mellon PERM Finding Prime Components B0 B1 B2 B3 B1 B0 B3 B2 B3 B2 B1 B0 B2 B3 B0 B1 B0 LDB A0 A1 A2 A3 LDA B0 PERM FMA A0 A1 A2 A3 LDA B0B1B2B3 LDB FMA

Carnegie Mellon Enumerating the Possible Tilings Ha ve thi s A0 A1 A2 A3 B3 B2 B1 B0 B1 B2 B3 B1 B0 B3 B2 B3 B2 B1 B0 B2 B3 B0 B1 A0 A1 A2 A3 B0B1B2B3