Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas.

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

Analysis of Computer Algorithms

Chapter 7 System Models.

Ada, Model Railroading, and Software Engineering Education John W. McCormick University of Northern Iowa.

By Rick Clements Software Testing 101 By Rick Clements

and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $

Credit hours: 4 Contact hours: 50 (30 Theory, 20 Lab) Prerequisite: TB143 Introduction to Personal Computers.

1 Processes and Threads Creation and Termination States Usage Implementations.

1 9 Moving to Design Lecture Analysis Objectives to Design Objectives Figure 9-2.

How a Domain-Specific Language Enables the Automation of Optimized Code for Dense Linear Algebra DxT – Design by Transformation 1 Bryan Marker, Don Batory,

Configuration management

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.

Chapter 11: Models of Computation

1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland

Database Performance Tuning and Query Optimization

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Test Taking Strategies for Aviation Meteorology (AMT 220)

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

Chapter 4 Memory Management Basic memory management Swapping

1 1 Mechanical Design and Production Dept, Faculty of Engineering, Zagazig University, Egypt. Mechanical Design and Production Dept, Faculty of Engineering,

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

Lecture 1: Software Engineering: Introduction

Computer Science at Oxford

Lecture 5: Requirements Engineering

Executional Architecture

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

Addition 1’s to 20.

Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 14: Protection.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

From Model-based to Model-driven Design of User Interfaces.

May 7, A Real Problem  What if you wanted to run a program that needs more memory than you have?

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.

Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.

OBJECT ORIENTED SYSTEM ANALYSIS AND DESIGN. COURSE OUTLINE The world of the Information Systems Analyst Approaches to System Development The Analyst as.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.

Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

 Programming - the process of creating computer programs.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.

Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

TensorFlow– A system for large-scale machine learning

14 Compilers, Interpreters and Debuggers

Using BLIS Building Blocks:

COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE

COP 5611: Operating Systems

CS 179 Project Intro.

Objective of This Course

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

GENERAL VIEW OF KRATOS MULTIPHYSICS

Using BLIS Building Blocks:

Cache writes and examples

Presentation transcript:

Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas DxT- 1

Introduction Software Engineering (SE) Software Engineering (SE) largely aims at techniques, tools to aid masses of programmers whose code is used by hoards these programmers need all the help they can get Many areas where programming tasks are so difficult, only a few expert programmers – and their code is used by hoards these experts need all the help they can get too DxT- 2

Our Focus is CBSE for… Dataflow domains: – nodes are computations – edges denote node inputs and outputs General: Virtual Instruments (LabVIEW), applications of streaming languages… Our domains: Distributed-Memory Dense Linear Algebra Kernels Parallel Relational Query Processors Crash Fault-Tolerant File Servers DxT- 3

Approach CBSE Experts produce “Big Bang” spaghetti diagrams (dataflow graphs) We derive dataflow graphs from domain knowledge (DxT) When we have proofs of each step: Details later… DxT- 4

State of Art for Distributed Memory D ense L inear A lgebra Kernels Portability of DLA kernels is problem: may not work – distributed memory kernels don’t work on sequential machines may not perform well choice of algorithms to use may be different cannot “undo” optimizations and reapply others if hardware is different enough, code kernels from scratch DxT- 5

Why? Because Performance is Key! Applications that make DLA kernel calls are common to scientific computing: simulation of airflow, climate change, weather forecasting Applications are run on extraordinarily expensive machines time on these machines = $$ higher performance means quicker/cheaper runs or more accurate results Application developers naturally want peak performance to justify costs DxT- 6

Distributed DLA Kernels Deals with SPMD (Single Program, Multiple Data) architectures same program is run on each processor but with different inputs Expected operations to support are fixed – but with lots of variants DxT- 7 BLAS3# of Variants Level 3 Basic Linear Algebra Subprograms (BLAS3) basically matrix-matrix operations

Distributed DLA Kernels Deals with SIMD (Single Instruction, multiple data) architectures same program is run on each processor but with different inputs Expected operations to support are fixed – but with lots of variants DxT- 8 BLAS3# of Variants Gemm Hemm Her2k Herk Symm Syr2k Trmm Trsm triangular matrix-matrix multiply general matrix-matrix multiply Hermitian matrix-matrix multiply symmetric matrix-matrix multiply solving non-singular triangular system of eqns

Distributed DLA Kernels Deals with SIMD (Single Instruction, multiple data) architectures same program is run on each processor but with different inputs Expected operations to support are fixed – but with lots of variants DxT- 9 BLAS3# of Variants Gemm12 Hemm8 Her2k4 Herk4 Symm8 Syr2k4 Trmm16 Trsm16 triangular matrix-matrix multiply general matrix-matrix multiply Hermitian matrix-matrix multiply symmetric matrix-matrix multiply solving non-singular triangular system of eqns

12 Variants of Distributed Gemm DxT- 10

Further Want to optimize “LAPACK-level” algorithms which call DLA and BLAS3 operations: solvers decomposition functions (e.g. Cholesky factorization) eigenvalue problems Have to generate high-performance algorithms for these operations too Our work mechanizes the decisions of experts on van de Geijn’s FLAME project, in particular Elemental library (J. Poulson) rests on 20 years of polishing, creating elegant layered designs of DLA libraries and their computations DxT- 11

Performance Results Target machines: Benchmarked against ScaLAPACK vendors standard option for distributed memory machines; auto-tuned or manually-tuned only alternative available for target machines except for FLAME DxT automatically generated & optimized BLAS3 and Cholesky FLAME algorithms DxT- 12 Machine# of CoresPeak Performance Argonne’s BlueGene/P (Intrepid)8, TFLOPS Texas Advanced Computing Center (Lonestar) TFLOPS

DxT- 13

Cholesky Factorization DxT- 14

DxT Not Limited to DLA DLA components are stateless – DxT does not require stateless components DxT originally developed for stateful Crash-Fault-Tolerant Servers Correct by Construction, can design high performing programs, and best of all: can teach it to undergrads! Gave project to an undergraduate class of 30+ students Had them build Gamma – classical parallel join algorithm circa 1990s using same DxT techniques we used for DLA code generation We asked them to compare this with “big bang” approach which directly implements the spaghetti diagram (final design) DxT- 15

Compared to “Big Bang” Preliminary User Study #s DxT /28 = 89%

They Really Loved It DxT- 17 I have learned the most from this project than any other CS project I have ever done. Honestly, I don't believe that software engineers ever have a source (to provide a DxT explanation) in real life. If there was such a thing we would lose our jobs, because there is an explanation which even a monkey can implement. It's so much easier to implement (using DxT). The big-bang makes it easy to make so many errors, because you can't test each section separately. DxT might take a bit longer, but saves you so much time debugging, and is a more natural way to build things. You won't get lost in your design trying to do too many things at once. I even made my OS group do DxT implementation on the last 2 projects due to my experience implementing gamma.

What are Secrets Behind DxT? DxT- 18