How a Domain-Specific Language Enables the Automation of Optimized Code for Dense Linear Algebra DxT – Design by Transformation 1 Bryan Marker, Don Batory,

Slides:



Advertisements
Similar presentations
Advanced Piloting Cruise Plot.
Advertisements

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Chapter 10 Architectural Design.
Chapter 7 System Models.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
UNITED NATIONS Shipment Details Report – January 2006.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
2010 fotografiert von Jürgen Roßberg © Fr 1 Sa 2 So 3 Mo 4 Di 5 Mi 6 Do 7 Fr 8 Sa 9 So 10 Mo 11 Di 12 Mi 13 Do 14 Fr 15 Sa 16 So 17 Mo 18 Di 19.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Week 2 The Object-Oriented Approach to Requirements
Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.
Configuration management
DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.
ABC Technology Project
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
Green Eggs and Ham.
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Exponential and Logarithmic Functions
Copyright 2001 Advanced Strategies, Inc. 1 Data Bridging An Overview Prepared for DIGIT By Advanced Strategies, Inc.
Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas.
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Januar MDMDFSSMDMDFSSS
Week 1.
Chapter 10: The Traditional Approach to Design
Analyzing Genes and Genomes
Systems Analysis and Design in a Changing World, Fifth Edition
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Chapter 13 The Data Warehouse
Energy Generation in Mitochondria and Chlorplasts
CpSc 3220 Designing a Database
1Model Driven Architecture – 3. März 2008 – Siegfried Nolte 1.UML – What is it and what is it good for ? 2.MDA – What is it and what is it good for ? 3.MDA.
From Model-based to Model-driven Design of User Interfaces.
User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
Presentation transcript:

How a Domain-Specific Language Enables the Automation of Optimized Code for Dense Linear Algebra DxT – Design by Transformation 1 Bryan Marker, Don Batory, Jack Poulson, Robert van de Geijn

The Development Cycle 2

Why? We already modularize – Functions, libraries, features – You don’t inline re-used code, you put it in a function – Develop DSLs to reduce redundant development for domains We still encode architecture/algorithm specifics – Can’t re-use next time around – Doable a few times, not thousands – Manually break through layers to optimize Why don’t we encode reusable, high-level knowledge? – About algorithms, operations, or architectures 3

PartitionDownDiagonal ( A, ATL, ATR, ABL, ABR, 0 ); while( ABR.Height() > 0 ) { RepartitionDownDiagonal ( ATL, /**/ ATR, A00, /**/ A01, A02, /*************/ /****************/ /**/ A10, /**/ A11, A12, ABL, /**/ ABR, A20, /**/ A21, A22 ); A21_VC_Star.AlignWith( A22 ); A21_MC_Star.AlignWith( A22 ); A21_MR_Star.AlignWith( A22 ); // // A11_Star_Star = A11; lapack::internal::LocalChol( Lower, A11_Star_Star ); A11 = A11_Star_Star; A21_VC_Star = A21; blas::internal::LocalTrsm ( Right, Lower, ConjugateTranspose, NonUnit, (F)1, A11_Star_Star, A21_VC_Star ); A21_MC_Star = A21_VC_Star; A21_MR_Star = A21_VC_Star; // (A21^T[*,MC])^T A21^H[*,MR] = A21[MC,* ] A21^H[*,MR] // = (A21 A21^H)[MC,MR] blas::internal::LocalTriangularRankK ( Lower, ConjugateTranspose, (F)-1, A21_MC_Star, A21_MR_Star, (F)1, A22 ); A21 = A21_MC_Star; // // A21_VC_Star.FreeAlignments(); A21_MC_Star.FreeAlignments(); A21_MR_Star.FreeAlignments(); SlidePartitionDownDiagonal ( ATL, /**/ ATR, A00, A01, /**/ A02, /**/ A10, A11, /**/ A12, /*************/ /*****************/ ABL, /**/ ABR, A20, A21, /**/ A22 ); } 4

Performance of Elemental on 8192 cores 5

What Does an Expert Do? Starts with algorithm Chooses architecture-specific implementation – Break through abstraction boundaries (when brave enough) – In-line code to expose details and inefficiencies Optimizes the code Transforms from algorithm to high-performance code Uses a DSL (when possible) – Abstract functionality into DSL – Maintain structure in code – Make transformations common across algorithms 6

Start with the algorithm 7

Inline Some Parallelization Choices 8

Inline Some Redistribution Choices 9

Optimize 10

Automate the Process! For operations in DSL, give implementations options – Architecture-dependent – Code an expert would inline Give transformations to optimize patterns found in DSL Transform just like an expert Use Model-Drive Engineering (MDE) Design by Transformation (DxT) 11

DSLs Enable This Many years of software engineering efforts DSLs provide well-layered and abstracted codes DSLs define and limit set of operations to be performed DSL make it easier to see common transformations in domain 12 LAPACK ScaLAPACK PLAPACK FLAME Elemental LINPACK

Speaking of DSLs Distinguished by constructs specific to domain Allows definition of domain-specific relationships compactly (preferably declaratively) In our opinion, this rules out traditional view of libraries – (Controversial) E.g. relational SQL (database queries) 13

View as DAG in the spirit of MDE (this is a DSL) 14

Transform with Implementations (choose refinement of abstractions) 15

Transform to Optimize 16

Grammars and Meta-Models Models are algorithms/code Meta-model defines correct models Ensures proper types and properties in code – Code compiles Meta-model for domain is grammar for DSL 17

Design by Transformation Two types of transformations come naturally – Box rewrites to specify abstraction implementations – Optimizations API for common abstractions already developed – BLAS, LAPACK, MPI – These are the abstraction boxes – Implementations/refinements are known Optimizations known to experts 18

Correct by Construction Start with correct model – E.g. derived to be correct with FLAME Apply correct transformations End with correct model – Compiles for target architecture – Gets same answer as starting model 19

The big idea… Encode transformations to be reused not code that is disposable 20

A Mechanical System Would… Have many instances of the two transformations Apply these to an input algorithm Transform the algorithm to many implementations of varying efficiency – Combinatorial explosion 21

How Much Does it Cost? An expert uses rough idea of runtimes to make implementation choices – Know which implementations are better than others For DxT generate all implementation and estimate costs (e.g. runtime or power consumption) – Search the space of possibilities – Attach cost to each of DSL’s possible operations Choose “best” implementations from the entire space 22

Cost Functions 23 Include machine-specific and problem-size parameters First-order approximations Just meant to separate bad choices from good

Prototype System Takes input algorithm graph Generates all implementations from known transformations 10s-10,000s of implementations – 2 Cholesky variants – 1 TRSM variant – 3 GEMM variants – 1 variant of preprocessor operation for generalized eigenvalue problem Same or better implementations as hand-generated These are indicative of many more operations that can be supported by DxT 24

Cholesky Cost Estimates 25

What DSLs Do For Us Enable us to layer code/functionality – Understand the layers – Encode transformations to break through layers Enable us to define meta-models/grammar to guide transformations and ensure correctness Limit the amount of operations (and cost functions) we need to support because functionality abstracted and re-used 26

DSLs Enable Us To… Encode transformations to be reused not code that is disposable 27

Future Work Encode more transformations Target SMP and sequential algorithms – Low-level BLAS kernels Improve cost estimates Algorithmic transformations (variants) Try other domains in HPC Replace libraries like libflame and Elemental with libraries of algorithms and transformations – Not just auto-tune 28

Questions? Read our SC11 Submission FLAME and libflame – Elemental – code.google.com/p/elemental Thanks to NSF and Sandia fellowships Rui Gonçalves, Taylor Riche, Andy Terrel 29

// // //A 11 = Chol(A 11 ) A11_Star_Star = A11; lapack::internal::LocalChol( Lower, A11_Star_Star ); A11 = A11_Star_Star; //A 21 = A 21 TRIL(A 11 ) -T A21_VC_Star = A21; blas::internal::LocalTrsm ( Right, Lower, ConjugateTranspose, NonUnit, (F)1, A11_Star_Star, A21_VC_Star ); //A 22 = A 22 – TRIL(A 21 A 21 T ) A21_MC_Star = A21_VC_Star; A21_MR_Star = A21_VC_Star; blas::internal::LocalTriangularRankK ( Lower, ConjugateTranspose, (F)-1, A21_MC_Star, A21_MR_Star, (F)1, A22 ); A21 = A21_MC_Star; // // 30

Elemental’s Layering 31

What This Gets Us Encode knowledge about component operations – Generate implementations and optimize with transformations Libraries of transformations used to generate libraries of code (DxT) – Not libraries of specific code for operation A on architecture B with characteristics C – Libraries of how to use many A’s, B’s, and C’s – Next generation libflame and Elemental consist of algorithms and transformations 32