University of Tennessee www.netlib.org/atlas Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee www.netlib.org/atlas.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Introduction to .NET Framework
Tahir Nawaz Introduction to.NET Framework. .NET – What Is It? Software platform Language neutral In other words:.NET is not a language (Runtime and a.
Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information.
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
7-1 INTRODUCTION: SoA Introduced SoA in Chapter 6 Service-oriented architecture (SoA) - perspective that focuses on the development, use, and reuse of.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.
Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.
1 I/O Management in Representative Operating Systems.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
How can ERP improve a company’s business performance?  Prior to ERP systems, companies stored important business records in many different departments.
Antonio M. Vidal Jesús Peinado
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
Tuning Libraries to Effectively Exploit Memory Prof. Misha Kilmer Emily Reid Stacey Ecott.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
1 Jack Dongarra University of Tennesseehttp://
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Supercomputing in Plain English Scientific Libraries and I/O Libraries National Computational Science Institute Intermediate Parallel Programming & Cluster.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Progress report on the alignment of the tracking system A. Bonissent D. Fouchez A.Tilquin CPPM Marseille Mechanical constraints from optical measurement.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Presented by The Lapack for Clusters (LFC) Project Piotr Luszczek The MathWorks, Inc.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Single Node Optimization Computational Astrophysics.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Empirical Search and Library Generators
BLIS optimized for EPYCTM Processors
Automatic Performance Tuning
for more information ... Performance Tuning
Adaptive Strassen and ATLAS’s DGEMM
How Efficient Can We Be?: Bounds on Algorithm Energy Consumption
A Comparison of Cache-conscious and Cache-oblivious Codes
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Outline Announcements Loading BLAS Loading LAPACK Add/drop by Monday
Presentation transcript:

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

What is ATLAS  A package that adapts to differing architectures via AEOS techniques -Initially, supply BLAS  Automated Empirical Optimization of Software (AEOS) -Machine searches opt space -Finds application- apparent architecture  AEOS requires: -Method of code variation »Code generation »Multiple implement. »Parameterization -Sophisticated Timers -Robust search heuristic

University of Tennessee Why ATLAS is needed  BLAS require many man-hours / platform -Only done if financial incentive is there »Many platforms will never have an optimal version -Lags behind hardware -May not be affordable by everyone -Improves vendor code  Allows for portably optimal codes -Obsolescence insurance  Operations may be important, but not general enough for standard

University of Tennessee ATLAS Software  Coming soon -pthread support -Open source kernels »SSE & 3DNOW! »GOTO ev5/6 BLAS -Performance for banded and packed -More LAPACK  Coming not-so- soon -Sparse support -User customization  Currently provided -Full BLAS (C & F77) »Level 3 BLAS u Generated GEMM -1-2 hours install time per precision u Recursive GEMM- based L3 BLAS -Antoine Petitet »Level 2 BLAS u GEMV & GER ker »Level 1 BLAS -Some LAPACK »LU, LLt

University of Tennessee Algorithmic Approach for Matrix Multiply  Only generated code is on-chip multiply  All BLAS operations written in terms of generated on-chip multiply  All transpose cases coerced through data copy to 1 case of on-chip multiply -Only 1 case generated per platform M C A B N K N M K * NB

University of Tennessee Algorithmic approach for Level 3 BLAS  Recur down to L1 cache block size  Need kernel at bottom of recursion -Use gemm-based kernel for portability Recursive TRMM

University of Tennessee 500x500 DGEMM Across Various Architectures

University of Tennessee x 500 Double Precision RB LU factorization

University of Tennessee 500x500 Recursive BLAS on UltraSparc 2200